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Preface 


Living systems display highly complex dynamic behaviors and can demonstrate self- 
organizing emergent properties for better survival. Thus, using traditional highly reduc- 
tionist or wet bench approaches alone are insufficient. Systems Biology has emerged as a new 
field integrating biological experimentation with theoretical concepts adopted from physics, 
mathematics, statistics, and computer science. More recently, computational modeling, 
artificial intelligence, and data analytics are gaining acceptance for predictive biological 
research. 

For metabolic engineering and synthetic biology, the use of cross-disciplinary techni- 
ques to analyze complex and high-throughput biological data is now widely adopted for 
designing and producing microbes for optimization of valuable specific yields. For support- 
ing disease investigation and interventions, computational and machine learning approaches 
are also gaining traction. 

In this book, I have carefully drawn individual chapters that provide protocols for 
computational, statistical, and machine learning methods, applied largely for metabolic 
engineering and synthetic biology, and two chapters for disease applications. The authors 
are well established scientists in their respective fields. These approaches will support the 
current progress in cross-disciplinary research that is widely discussed and explored as the 
next step for integrating the different scales of biological complexity. 

Geared toward researchers with limited molecular engineering and computational 
analytical or modeling experience, the book provides a broad overview of the subject and 
detailed instructions in computational and machine learning approaches. The text is written 
in simple technical manner as an introduction for physicists, chemists, computer scientists, 
and biologists who are interested to understand how basic to advanced computational 
biology and machine learning methods are adopted for metabolic engineering, synthetic 
biology, and disease modeling research. 

Chapter 1 by Daboussi and Lindley begins with the general overview of the field of 
metabolic engineering and the challenges facing industrial biotechnology applications, 
especially for high-value products. In the following chapter, Chang and colleagues provide 
a brief historical background to synthetic biology, which goes on to succinctly summarize 
the recent machine learning activities as applied to diverse synthetic biology applications. 

In Chap. 3, Gilliot and Gorochowski report a computational model that can analyze and 
predict massively parallel reporter assays (MPRAs) experiments, based on fluorescence- 
activated cell sorting procedures. Andre and colleagues, in Chap. 4, provide computational 
methods to investigate and build molecular assemblies of proteins, which can then be used 
to predict structural models of the protein partners and, using coevolution information, 
search for interacting regions. Smith introduces a new bioinformatics pipeline of genome 
mining to structural protein engineering in Chap. 5. 

Next, Logel and Jaschke, in Chap. 6, provide a new workflow with Markov Hidden 
Models to create synthetic overlaps between two proteins, by protecting the engineered 
coding sequences from mutation or loss of function. This is followed by a Boolean Logic 
model demonstration of convergent promotor for synthetic biology applications by Abraha 
and Marchisio (Chap. 7). Fundamentally a similar digital approach was taken by Guizio and 
colleagues for the design of recombinase logic circuits in Chap. 8. 
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Focusing on lab to industrial scale bioprocessing, Yeoh and Poh describe an integrated 
modeling workflow using experimental dataset with kinetic modeling approach and compu- 
tational fluid dynamics (Chap. 9). Following, Collins and colleagues present a kinetic model 
using spatial information to study the subcellular recruitment of optogenetic protein to 
plasma membrane for synthetic biology applications (Chap. 10). For engineering a microbe 
Vibrio natriegens to produce of 1,3-propanediol, Zhen and colleagues show genome-scale 
metabolic models and genome editing protocols as crucial workflows in Chap. 11. 

In Chap. 12, Helmy and Selvarajoo present a pipeline for a rigorous transcriptomics data 
analytics for synthetic biology applications, while Gendoo demonstrates a suite of bioinfor- 
matics software and databases that are very useful for metabolic engineering (Chap. 13). 
Sugimoto presents a three-dimensional mathematical model and protocol, in Chap. 14, that 
incorporates the spouting and branching events in angiogenesis and tumor growth in 
cancers. 

The final five chapters, Chapters 15, 16, 17, 18, and 19, describe machine learning 
methods for diverse applications from metabolic pathway (Bonetta and colleagues, 
Cuperlovic-Culf and colleagues), omics (Niranjan and colleagues) and disease (Occhipinti 
and colleagues) analyses as well as to elucidating protein interaction networks (Sundar and 
colleagues). 

I believe the chapters presented in the book will be useful for all readers to grasp the 
general trend of modern computational methods applied to understand and predict com- 
plex biology. 


Bioinformatics Institute, Kumar Selvarajoo 
Agency for Science, 

Technology & Research, 

Singapore, Singapore 
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Challenges to Ensure a Better Translation of Metabolic 
Engineering for Industrial Applications 


Fayza Daboussi and Nic D. Lindley 


Abstract 


Metabolic engineering has evolved towards creating cell factories with increasingly complex pathways as 
economic criteria push biotechnology to higher value products to provide a sustainable source of speciality 
chemicals. Optimization of such pathways often requires high combinatory exploration of best pathway 
balance, and this has led to increasing use of high-throughput automated strain construction platforms or 
novel optimization techniques. In addition, the low catalytic efficiency of such pathways has shifted 
emphasis from gene expression strategies towards novel protein engineering to increase specific activity of 
the enzymes involved so as to limit the metabolic burden associated with excessively high pressure on 
ribosomal machinery when using massive overexpression systems. Metabolic burden is now generally 
recognized as a major hurdle to be overcome with consequences on genetic stability but also on the 
intensified performance needed industrially to attain the economic targets for successful product launch. 
Increasing awareness of the need to integrate novel genetic information into specific sites within the 
genome which not only enhance genetic stability (safe harbors) but also enable maximum expression 
profiles has led to genome-wide assessment of best integration sites, and bioinformatics will facilitate the 
identification of most probable landing pads within the genome. 

To facilitate the transfer of novel biotechnological potential to industrial-scale production, more atten- 
tion, however, has to be paid to engineering metabolic fitness adapted to the specific stress conditions 
inherent to large-scale fermentation and the inevitable heterogeneity that will occur due to mass transfer 
limitations and the resulting deviation away from ideal conditions as seen in laboratory-scale validation of 
the engineered cells. To ensure smooth and rapid transfer of novel cell lines to industry with an accelerated 
passage through scale-up, better coordination is required form the onset between the biochemical 
engineers involved in process technology and the genetic engineers building the new strain so as to have 
an overall strategy able to maximize innovation at all levels. This should be one of our key objectives when 
building fermentation-friendly chassis organisms. 


Key words Cell factory, Industrial fermentation, Genetic stability, Pathway optimization, Biotechnol- 
ogy, Specialty chemicals 


1. Introduction 


When contacted to prepare this chapter, our initial thoughts were 
to prepare an overview of all the positive advances that have been 
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1.1 Historical 
Prospective 


made to generate new engineered microbes. However, this is a 
mammoth undertaking and would probably justify an entire book 
rather than a simple introductory chapter. Indeed, many of the 
other chapters will deal with just such updates and showcase some 
of the successful achievements made to date. Instead we have tried 
to look at why much of the extraordinary potential of metabolic 
engineering has not always been successfully translated to industry 
and how the challenges have evolved as metabolic targets shift from 
relatively simple molecules to more complex high-value metabolites 
which we believe are going to be increasingly important. Indeed, a 
recent opinion paper [1] gave some examples of current success 
stories issued from the synthetic biology approach to metabolic 
engineering and some ideas as to what the future might hold. A 
few years back, we might optimistically have stated that “the sky is 
the limit,” but some of the ideas go beyond this stratospheric 
ceiling and deal with concepts that could facilitate sustainable 
space travel. Despite these success stories and a growing number 
of products coming out of the pipeline, many potentially exciting 
strains continue to struggle in spanning the key titer, rate, and yield 
(TRY) levels between feasibility study in the research laboratory 
and industry. We will try and identify where some of the current 
bottlenecks are situated in driving such studies to industrial 
exploitation as this needs to be better understood by those working 
outside industry. Obviously this overview cannot be complete and 
serves more to illustrate some areas which the growing synthetic 
biology community might want to invest in and accelerate the 
translation of promising research into realistic industrial 
exploitation. As the world awaits sustainable solutions to meet the 
requirements of consumers, an unprecedented opportunity awaits 
us, and we have to ensure that the impressive accumulation of new 
knowledge can be translated into biotechnological answers to this 
demand. 


Synthetic biology is quite a vast domain covering a whole variety of 
topics, but we have voluntarily restricted this to what used to be 
covered by the term metabolic engineering or the rational engi- 
neering of the microbial genome to generate high-performance 
microbes which fall into the cell factory concept for economically 
interesting conversion of simple feedstock into desirable metabo- 
lites which was initiated more than 40 years ago [2, 3]. As many of 
us active in this domain appreciate, the rational adjective is often the 
limit to this statement and often a weak spot in metabolic engineer- 
ing which has turned to alternative high-throughput logic to 
overcome the incomplete understanding of how microbes function 
[4, 5] and more precisely how such functionality will be modified 
by adding new biochemical pathways [6] to the existing metabolic 
network. Much of the early progress in metabolic engineering 
covered the upgrading of natural pathways to enable efficient 
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accumulation of the desired product using genetic tools to modify 
the pathway composition and regulation but fundamentally was 
often optimizing pathways that already existed in the cells or 
needed relative modest additions to deviate a pathway to an 
alternative product [7-9]. This led to a significant number of 
applications covering simple molecules such as amino acids, 
vitamins, organic acids, and solvents as well as some of the key 
biopolymers. Details of how this was achieved can be found in 
any of the reviewers which have appeared on a regular basis. One 
of the key problems to be faced was conversion efficiency, because 
chemical synthesis from fossil fuels was often easier to industrialize 
leading to the situation in which profitability was dependent on the 
relative cost of petrochemical and sugar feedstocks and the conver- 
sion factors involved. Looking into cheap feedstocks has also helped 
drive this push and avoids competition with the requirement for 
food for the classical fermentation substrates. 

Recently, many of these metabolic engineering strategies have 
hit economic difficulties despite some excellent scientific progress 
being made. While this varies from application domain to another, 
the current situation does not favor the biotechnological produc- 
tion of simple bulk chemicals except when the biological process 
offers some key advantages such as avoiding toxic waste products. 
The strong competition from chemical synthesis using fossil 
resources has, however, generated very strong pressure to improve 
yields and final concentrations, and we now have an increasing list 
of molecules that can be produced with yields close to the theoreti- 
cal maxima and in concentrations often exceeding 100 g/L. 
Despite this we still suffer in process competitiveness, though this 
could of course change in the intervening years as climate change 
criteria become increasingly important in a carbon neutral vision of 
the world’s industrial economy. To put this into perspective, 
however, you have to take into account that a huge majority of 
petrochemicals are used for energy production (about 75% of out- 
put) in one form or another, and only a relatively small amount is 
used for a highly diverse list of chemical synthons used to produce 
several thousand products [10]. Replacing any single product 
outside the energy domain is therefore unlikely to have a huge 
effect on climate considerations. 

The metabolic engineering community has switched its logic to 
react to this situation. Some moved towards the domain of systems 
biology aware that the underpinning knowledge base was often a 
limiting factor in the pragmatic engineering strategies that had 
attracted them initially to the metabolic engineering domain 
[6]. In parallel, we also saw a consolidated movement towards 
high-throughput automated platforms (biofoundries) coming 
into play [11] so that the rational knowledge-based optimization 
could effectively be replaced or complimented by a more empirical 
investigation of the unknown phenomena that remained obscure 
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and hence difficult to engineer. We have seen increasingly sophisti- 
cated platforms which can upgrade our strain engineering capacity 
quite remarkably so that today we can explore multiple possibilities 
in a timeline which was previously not even a possibility in our 
wildest dreams. The challenge here is not only to capture the best 
constructs which match with our application-driven targets but also 
to capture the wealth of information that is hidden in the strains 
which do not perform as we would hope. This is typified by the 
impressive facilities available to some of the private synthetic 
biology companies exploring new product development such as 
Amyris, Zymogen, and Ginkgo Bioworks who have integrated 
advanced bioinformatics with fully automated strain construction 
platforms with the capacity to rapidly optimize novel metabolic 
pathways, though this development is restricted to a relatively 
small number of workhorse chassis organisms. Coupled to some 
of the machine learning technology, this abundance of data will 
point us more rationally to some of the engineering which is today 
still rather empirical and help build a systems biology knowledge 
base to better predict most probable strain constructs 
[12, 13]. How we move in that direction will depend probably on 
our capacity to integrate automatic data handling and metabolic 
modeling tools to extract the hidden meaning from our “failed” 
constructs. 

Accompanying these changes was a movement towards 
engineering strains able to produce more complex secondary meta- 
bolites with high added value. The appearance of the term synthetic 
biology acknowledged the fact that today we are not simply 
engineering known established pathways but actually looking at 
novel artificial pathways and entirely novel regulatory control 
mechanisms in what we might term synthetic biodiversity. This 
shift towards more complex biochemical pathways of course brings 
its own specific problems and some new paradigms to be resolved 
which will ultimately need those of us working in this domain to 
expand our toolbox to meet these challenges. Many of the 
references cited above cover some of these aspects. At a moment 
where the industrial world is awaiting this innovation, we need to 
be sure that what we are developing in the laboratory can be 
translated to the world of industrial fermentation in which the 
microbes are often subject to quite extreme growth conditions 
and resulting metabolic stress building on the specific stress of 
having their metabolic topology modified quite considerably. All 
too often our highly tuned and sensitive microbes fail to perform 
when placed in the industrial context, and this is harmful for our 
credibility as few domains are so closely tied to the application 
potential as metabolic engineering. 
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We have the expertise, but maybe we need to be selective in 
how we use this toolbox to generate strains which are more robust 
and able to deal with the consequences of metabolic burden 
[14]. Are synthetic biology and the specific domain of metabolic 
engineering failing to live up to its promise and anchored predomi- 
nantly as an academic exercise to showcase some elegant and rather 
sophisticated reprogramming of microbial metabolism? We would 
like to give a categorical negative reply to this, but the truth is 
probably less clear-cut. A lot of promising strains still fail to perform 
as expected when translated to realistic industrial fermentation 
conditions. Let’s look at some of the aspects which are currently 
bottlenecks to achieving the type of performance that is needed to 
make our processes economically viable and scalable to large-scale 
industrial fermentation applications. 

First of all, however, let us dispel the idea that GMO technol- 
ogies are a problem: our arsenal of therapeutic molecules is today 
predominantly from genetic engineering which has taken over 
largely from chemical synthesis as the therapeutic targets become 
increasingly large molecules. Likewise, the biofuels and substitu- 
tion logic for compounds used by the speciality chemicals industry, 
currently derived predominantly from non-sustainable petrochem- 
icals, has been driven largely by genetically engineered high- 
performance strains. Perhaps the food industry is still recalcitrant 
to GMO foods, but in many cases, we will be producing ingredients 
which suffer no such labeling restrictions as long as such products 
are DNA-free and demonstrated to be safe. One might point out 
that the quasi-totality of all the enzymes we employ in the food 
industry [15] is being produced using high-performance chassis 
organisms which have been engineered to attain enzyme concen- 
trations which can reach concentrations of more than 100 g/L and 
quite regularly surpass 50 g/L. To this technical enzyme market, 
we can also see major innovation to boost non-animal dairy pro- 
ducts and other animal-derived protein sources, driven by animal 
welfare and sustainability concerns. Of course therapeutic proteins 
are exclusively produced using engineered chassis organisms 
whether they be microbial plant or animal cell lines [16]. However, 
it is probably in the domain of speciality ingredients that market 
opportunities are highest with a shift of consumers to biosourced 
supply chains. The cellular content of such metabolites in plants 
tends to be very low, and it is debatable today if land use should be 
dedicated to such crops when engineered microbes can produce 
concentrations several orders of magnitude higher than found in 
plants and reproduce exactly the same stereochemistry (often a 
major influence on the flavor and fragrance profiles or biological 
activity) in these nature-identical molecules [17]. 
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Fig. 1 Relationship between market volumes and price of products 


1.2 Economic 
Constraints and a 
Requirement for 
Underpinning 
Multidisciplinary 
Outlook 


If you look at the cost structure of bioprocessing, there is a general 
inverse correlation between the market price and the market vol- 
ume (Fig. 1), but this also reflects cost structure in the production 
process. With bulk chemicals, volumes are high and market values 
are quite low, so efficient conversion is essential, and a real effort is 
needed to exploit the cheaper and abundant feedstocks. To be 
economically viable, high conversion efficiency close to the maxi- 
mum theoretical conversion yields is needed, and concentrations in 
the fermenter need to be high to facilitate cost-effective recovery. 
Energy and feedstock are probably the key costs, and this focuses 
the strain engineering on clearly identified phenomena controlling 
flux orientation into the desired synthetic pathway. 

When we look at high-price commodity chemicals, the situa- 
tion is rather different, and while these yield and concentration 
factors remain important, they are often no longer the dominant 
factor. The pathways are more complex and yields are often lower 
and the contribution of fermentation to the overall cost decreases 
with an increasing downstream processing cost. This is typified in 
some of the therapeutics in which the fermentation costs may be 
quite a low part of overall costs. The product has to meet purity 
criteria and formulation, and the downstream recovery is often the 
dominant cost. Because of this, the focus of the research is not so 
much to produce more but to produce cleaner such that separation 
costs can be diminished. It is often more important to remove a 
minor contaminant than to produce a few grams per liter more of 
the desired metabolite. 

In this context, using poorly defined complex but cheap feed- 
stocks, essential for bulk chemicals, would only have a marginal 
saving on the fermentation step but could have a disastrous conse- 
quence on downstream processing costs. This pleads for a global 
systemic view on the various limitations that have to be overcome 
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before the strain engineering begins to fix the framework which has 
to be respected in any strain that is generated [18] and can only 
exist when the strain engineering is planned in close collaborative 
effort with the process engineers who will have to translate this to 
an industrial logic. Anything else reduces the process engineering 
to damage limitation and slows down the transfer. In our view, this 
is one of the key difficulties which is encountered in many academic 
laboratories which tend to be focused either on the biology or on 
the process engineering. 

One of the consequences is that effort is spent in elaborate 
attempts to resolve via genetic engineering, phenomena which 
could be easily avoided by a better understanding of how the 
process could be designed to avoid the problem becoming mani- 
fest. This seems extremely logical, so why is this relatively rare in 
academic research structures while quite frequent in companies? 
This maybe just reflects the way our academic departments have 
been structured over many years with quite a deep divide between 
life science and engineering faculties, often situated quite some 
distance apart on the university campus and not always encouraged 
to work towards common goals. It also reflects what are considered 
to be academic success criteria as compared to what is vital for 
industry and the sequential nature in which biotechnology is 
often planned in which the strain development is usually close to 
completion before we hand this over with all its inherent strengths 
and weaknesses to the next link in the chain who then tries to make 
the best choices to optimize the strain performance in fermentation 
development before then looking at how we would recover the 
product. 

Since a lot of opportunities for innovation depend directly to 
the preceding steps, the innovation space progressively diminishes 
and yet often needs extensive effort to find solutions compatible 
with the intrinsic weaknesses of the work that has gone beforehand. 
This pleads for a more open discussion to fix what are most proba- 
ble requirements downstream of the metabolic engineering and a 
reverse engineering logic in which the strains are designed from the 
onset to be compatible with the process most likely to be used. It is 
our view that adopting such a strategy would have a significant 
effect in accelerating the work as it moves up the TRL scale and 
avoids “back-to-the-drawing-board” situations when strain perfor- 
mance collapses when faced with the process environment. While 
the “Design-Build-Test-Learn” cycle is inherent to synthetic biol- 
ogy, common sense tells us that you would like to limit to a strict 
minimum the number of revolutions within this cycle in order to 
shorten development timelines. This is the underpinning logic 
(Fig. 2) of a pan-European consortium of laboratories (IBISBA) 
that brings together different disciplines in some of the top labora- 
tories to make available a consolidated platform to accelerate the 
penetration of synthetic biology into industrial biotechnology 
applications [19]. 


8 Fayza Daboussi and Nic D. Lindley 


Pan-European Research Infrastructure 


Classical approach 10 to 15-year development 


— 
N 


=> 


y 4 
AP 


3 4 


Integrative approach 4-6-year development 


» 
‘NX 
»> 


, Accelerating industrial biotechnology ' 


Enabling the bioeconomy 


Fig. 2 An integrated multidisciplinary logic to accelerate development timelines from synthetic biology to 
industrial biotechnology as proposed by the European IBISBA Network 


1.3 Optimizing 
Complex Branched 
Secondary Metabolite 
Pathways 


Many of the interesting pathways that lead to high-value speciality 
chemicals are pathways that are intrinsically complex due to the 
molecular structure of the product and frequently involve quite 
promiscuous enzymatic activities which lead to families of mole- 
cules of similar structure but often very different biological activ- 
ities. Furthermore, their expression in natural hosts is often 
regulated by complex and often obscure trigger stimuli such that 
much of the natural biodiversity present in nature is often not 
expressed using classical growth conditions. Today transfer of 
genes into a pre-optimized chassis organism can unlock part of 
this silent biodiversity [20]. 

The enzymatic promiscuity is often seen in plant-based essen- 
tial oils which contain multiple compounds and contribute to the 
overall fragrance of the oil and are intrinsically coupled to the value 
of the oils. However, not all these compounds are safe when the oil 
is used in cosmetics, and this is a cause for concern at the moment 
with REACH risk approval necessary for applications beyond a 
certain quantity in Europe. Simply transferring the pathway as it is 
to a new host might well intensify the production capacity and lead 
to better sustainability, but it does not necessarily modify the 
mixture, and this needs to be addressed to remove any undesirable 
co-products. Fortunately, protein engineering can modulate the 
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promiscuous nature of these enzymes and focus the pathway on the 
desirable products, often changing the pathway towards the more 
valuable products at the same time [21 ]. 

Expressing such secondary metabolite pathways does of course 
lead to increasing complexity in attaining high yields. The obliga- 
tory pathway flux optimization becomes increasingly difficult as the 
pathway involves more and more reactions, following a geometric 
progression in combinatory expression profiles. Early progress in 
metabolic engineering often involved primary metabolites which 
were closely linked to central metabolism in many cases such that 
modifying the network did not involve a significant protein burden 
and introduction of very few novel genes. It came with its own 
problems as yields were high, and so biomass synthesis was mod- 
ified due predominantly to the fact that carbon and energy fluxes 
were being deviated away from anabolism to non-growth-related 
end products. This could be dealt with quite effectively by decou- 
pling growth from metabolite accumulation phase using fed-batch- 
type strategies. 

These same strategies are often employed to induce the more 
complex, secondary metabolism-derived biosynthetic pathways, 
but consequences for cell fitness are accentuated by the overloading 
of the protein synthetic machinery. Indeed, the logic behind 
growth decay during production is somewhat different as it not 
due to modified carbon flux throughout the metabolic network in 
most cases but the huge requirement for synthesis of high concen- 
trations of enzymes not required for growth. This is compounded 
by the often intrinsically low enzymatic activity of these pathway 
enzymes which have evolved in natural evolution to produce rather 
small amounts under very precise environmental conditions. Cur- 
rently we use sledgehammer tactics to overexpress these activities to 
overcome such limitations and indeed attain performance levels 
much higher than in natural producers. With pathways with typi- 
cally 12-20 reactions at minimum, such overexpression monopo- 
lizes a significant part of the ribosomal protein synthesis machinery. 

The ribosomes can be considered like any catalytic reaction in 
biological systems in which multiple substrates (mRNA transcripts) 
with different affinities (RBS sequences) and ribosomal loading 
onto the mRNA determine the rate of protein synthesis and 
hence the intracellular concentration of each reaction. All mRNA 
species compete for access to ribosomes, and ultimately the global 
protein output reflects this open system. Sudden and massive 
induction of novel proteins which contribute a significant part of 
the overall mRNA population has inevitable consequences on the 
synthesis of cellular proteins and will provoke a diminished capacity 
of the cells to develop and multiply. 
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In other words, the expression of a non-growth-associated 
pathway will automatically slow down the rate at which growth 
essential proteins can be synthesized. Of course this depends also 
on the mRNA decay rates which also contribute to the active 
cellular concentrations and which can now to some extent be 
engineered [22]. One might also consider genome streamlining 
as a means to remove some of the protein synthesis stress associated 
with this protein burden effect. If there are less proteins being 
synthesized, then ribosomal efficiency for the essential proteins 
might be improved. Many proteins are present in cells as part of 
an adaptive response, and some of these have little value in a 
cultivation system in which specific growth conditions can be main- 
tained. Estimates on non-essential genes are very variable and 
depend on multiple criteria used to make the identification, often 
looking at it from a genome viewpoint when, from this particular 
application viewpoint, it would need to assess the usefulness of the 
genes actually transcribed under the chosen growth condition. We 
will come back to some of these aspects when discussing genetic 
stability later in the chapter, but what are our options to alleviate 
this phenomenon? 

Rather than treating the consequence of the loss of fitness, it 
might be better to attack the cause of the problem: the naturally 
low Kcat values of these enzymes. There is tremendous scope for 
redesigning these enzymes so that pathway flux can be maintained 
without such drastic overexpression. The problem is not simple as 
natural evolution has favored low activity for many secondary 
metabolite pathway reactions, and this will certainly call on the 
automated exploration of synthetic biodiversity coupled to AI 
technologies. However, gains in specific activity would remove 
some of these protein burden situations and the metabolic stress 
associated and be game-changers for such speciality chemical con- 
version efficiencies. 

Beyond this rupture with the way we have been generally 
looking at how to engineer organisms, we also have a very prag- 
matic problem on how to optimize such pathways as the combina- 
tory options that need to be explored are huge if we do this reaction 
by reaction. In this respect, prokaryotes with their operon-based 
coding offer some advantages [23]. Let’s take a classical pathway 
with say 14 enzymes involved. If you look at just four different 
promoters for each reaction, you have 4'* different constructs or 
2.7 x 10% strains to construct and assess. If you segregate this 
pathway into say 4 operons, you can explore the same experimental 
space in a 4* matrix, so only 256 constructs [24]. Further optimi- 
zation can be achieved rapidly in such a modular logic by doing 
intra-operonic balancing using defined RBS sequences in what has 
come to be known as multidimensional heuristic optimization [25 ] 
which can rapidly establish a heat map-type logic of the extent to 
which the pathway can be optimized (Fig. 3) without having to 
have biofoundry-type facilities available. 
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Fig. 3 Multidimensional optimization logic using a modular approach (a) and a typical pathway flux heat 
diagram showing best option construction space (b) 


As pathway flux gets closer to a biochemical network maxi- 
mum, or co-factor availability becomes a bottleneck, optimization 
needs to be extended to include updates to the central metabolic 
pathways, but this is a complex challenge due to the highly regu- 
lated metabolic topology of this network. Quite considerable prog- 
ress can still be made in the speciality chemicals domain before 
having to engage in this additional level of complexity. However, 
these molecules are often challenging because these are associated 
with a certain toxicity for the producer strain and maybe we need to 
concentrate more on engineering efflux systems to remove these 
compounds efficiently from the cells and then employ fermentation 
technology utilizing multi-phasic technologies to remove the com- 
pound directly from the aqueous phase, and hence overcoming 
these toxic effects which otherwise accentuate the inevitable loss 
of fitness in such engineered cells. 

So far we have been considering optimizing what are predomi- 
nantly natural pathways re-engineered to boost performance but 
still suffering from the overriding constraint in many biological 
systems. Overall metabolic systems have evolved as a system with 
a view to favoring growth, first and foremost rapid growth, but able 
to switch to efficient growth if substrate limitations occur. One of 
the guiding principles in natural evolution is therefore to enable 
pathways to be regulated so as to maintain optimal homeostasis 
which in turn ensures best metabolic efficiency relating to growth 
and fitness. Of course, when looking at biotechnology applications, 
the first thing we have to do is bypass the complex regulatory 
phenomena that maintain a balanced supply of all anabolic precur- 
sors while often we retain the basic reaction sequence that nature 
has evolved. 

If we shift into a different logic, we might ask if an organic 
chemist would derive the same reaction sequence and whether their 
pathway would be compatible with a specific growth regime that we 
as biotechnologists ought to be able to control so as to modify the 
fitness criteria and enable those pathways with best thermodynamic 
efficiency to be proposed rather than those in which the complexity 
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1.4 Genetic Stability 


has been selected to enable better control of a constrained require- 
ment for that pathway [26] and more generally to maintain 
growth-related homeostasis. This would often require engineering 
of the enzymes to adapt them to novel reactions and assembling 
such reactions in a novel manner, but it opens up new possibilities 
to overcome key pathway limitations which would be otherwise 
difficult to resolve. 


As the metabolic engineering challenge increases to redesign and 
express more complex pathways, the question of genetic stability 
will become increasingly important together with a more detailed 
knowledge of how genomic integration can by the choice of the 
integration site modulate the efficiency of gene expression. As we 
have seen, engineering new functions into cells creates a loss of 
fitness, accentuated by the reinforced process-induced stress asso- 
ciated with poorly mixed large-scale fermenters and what is termed 
metabolic burden. 

The instability of production strains is the result of several 
environmental constraints intrinsic to the production process but 
also those intrinsic to the organism (metabolic cost, toxicity, DNA 
repair mechanisms, etc.). One of the consequences of this is a 
tendency for cells to try and remove the cause of this burden; the 
novel genetic elements introduced into the genome and the various 
factors involved are detailed in a recent paper [27]. In many cases, 
we are still reliant on multi-copy plasmid technology and the intrin- 
sic loss of such plasmids which have used either antibiotic resistance 
(unacceptable in industrial fermentations for many applications) or 
auxotrophic complementation to limit loss. The problem 
concerning most of these systems is that while they do offer some 
protection as long as the selective pressure is maintained, the full 
force of this protection is concentrated on the loss of the last copy, 
so expression profiles can change throughout the production 
phase. This becomes more acute when the industrialization tries 
to prolong the period of production and most acute when shifting 
to continuous or semi-continuous production modes. 

The classical response to this lack of plasmid stability is to 
integrate the novel pathway into the genome, and when envisaging 
this strategy, the question that begs to be answered is “can I favor 
genetic stability and best expression of the genes introduced by 
choosing where in the genome I make the integration?”. Today one 
of the challenges which interfaces metabolic engineering with sys- 
tems biology is to better understand how genome fine structure can 
be exploited to limit genetic instability. Of course genetic instability 
often leads to yield loss but also provokes unpredictability of pro- 
duction and knock on effects for product recovery. 

Quite often a production organism has multiple copies of the 
required genes present in the genome, and unless designed so as to 
reduce homology using full scope of alternative codons, recombi- 
nation is always likely to occur. Likewise, presence of transposable 


1.5 The Concept of 
Genomic Safe Harbors 
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elements, frequent in many microbial genomes, will tend to facili- 
tate genome editing as a survival response to stress conditions. 
Removing such sequences from the production host has been 
shown to increase genome stability in strains of Escherichia coli, 
engineered for 1,4-butanediol production [28], and no doubt this 
approach needs to be extended to all chassis organisms envisaged 
for industrial production. In a wider logic, synthetic biology pro- 
jects to create synthetic chromosomes can remove all such mechan- 
isms favoring genetic instability such as has been demonstrated for 
the yeast genome project in which transposon elements were 
removed [29]. Maintaining introduced pathways has a metabolic 
cost to the cell and will inevitably lead to attempts to remove such 
genes requiring strain monitoring, preventive measures to attenu- 
ate this probability and choice of fermentation conditions which 
uncouple production from growth would certainly help limit the 
consequences, but can we construct our strains in such a way that 
best possible stability can be ensured? 


The problem of transgene instability is a well-known phenomenon 
in the pharmaceutical industry, where production cell lines are 
mainly generated by random transgene integration. As a result, 
these cell lines often need to be discarded during the cultivation 
phase due to a progressive loss of productivity, which may be the 
consequence of chromosomal rearrangements and/or of transcrip- 
tional repression by methylation. To circumvent these issues, sev- 
eral groups have pointed out the necessity of integrating transgenes 
at specific loci, which should allow safe and stable expression over 
time. The notion of “safe harbor” appeared at the very beginning of 
this century with the first gene therapy projects aiming to introduce 
a copy of a functional gene into the cells of patients with a 
defective gene. 

The success of gene therapy requires a stable expression of the 
introduced gene and without deleterious impact for the organism. 
Thus, the introduction of the gene in these safe loci is the sine qua 
non of a gene therapy guaranteeing maximum safety and efficacy. 
These integration sites called genomic safe harbors (GSHs) are 
defined as chromosomal locations where a transgene can integrate 
and function in a predictable manner without disrupting the activ- 
ity of endogenous genes and altering the viability of the organism 
[30]. Most GSHs were identified after random integration of a 
lentivirus carrying a cassette containing either a promoter-less 
reporter gene encoding antibiotic resistance, green fluorescent pro- 
tein, or B-galactosidase to make a promoter trapping approach or a 
full expression system followed by phenotypic screening. These 
approaches were successfully used in human cells [31, 32], embry- 
onic mouse cells [33], and CHO cells [34] to identify sites for 
expression of these transgenes. 
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1.6 Identification 
of Integration Loci 
in Microbial Cell 
Factories 


1.7. Genome Editing, 
a Powerful Tool to 
Target Desired Sites 


Furthermore, the lack of proven pathology after modification 
at these loci has subsequently led to a consequent interest in gene 
therapy though their implantation in gene-rich regions may pose a 
risk of deregulation of adjacent genes [35]. As such, they do not fit 
the selection criteria initially proposed, i.e., (i) be at least 50 kb 
from the 5’ end of a gene, (ii) be at least 300 kb from oncogenes, 
(iii) be at least 300 kb from genes coding for microRNAs, (iv) be 
outside a transcription unit of a gene, and (v) be outside ultra- 
conserved regions. These criteria are intended to limit the risk of 
disruption of endogenous genes as well as long-distance interac- 
tions between vector-encoded transcriptional activators and adja- 
cent genes [30]. Research continues to find the best safe harbor 
sites in various genomes, but this very targeted logic with specific 
objectives is progressively becoming a gold standard for all meta- 
bolic engineering projects which need not only to optimize both 
the stability of the genetic information added to the production 
host but also to minimize direct consequences by modifying 
expression of adjacent genes. 


The identification of integration loci in microorganisms, especially 
yeasts, emerged in the 2010s with the rise of synthetic biology. As 
we have seen, generating production organisms able to synthesize 
complex molecules requires addition of quite large numbers of 
genes, and there is a real requirement to ensure stable and predictable 
transgene expression without affecting neighboring genes. One of the 
first studies involved introducing the LacZ gene into 20 different 
integration sites in the Saccharomyces cerevisiae genome and measuring 
B-galactosidase activity [36]. The study revealed up to eightfold differ- 
ences in activity depending on the integration locus and showed that 
regions near telomeres were less favorable for expression and that 
regions near replicating sequences were more favorable. Later work 
validated 11 individual integration loci located in the intergenic 
regions of S. cerevisiae chromosomes X, XI, and XII for their abilities 
to ensure high transgene expression [37]. In addition, the sites were 
separated by essential genes, which prevents loss of integrated frag- 
ments through recombination and ensures the stability of the strain. 
Finally, these sites were all located in a minimum of 750 bp intergenic 
regions to reduce the impact on neighboring genes. Once such loci 
had been validated, they illustrated the full potential for metabolic 
engineering by introducing the seven-step indolylglucosinolate path- 
way of Arabidopsis thaliana, a multigene pathway with up to 22 genes. 


The development of cheap and easy-to-use genome editing tech- 
nologies has increased our ability to manipulate the genetic makeup 
of cells and microbes. This technology allows (i) the insertion of 
DNA fragments into targeted genomic locations; (ii) the deletion 
of small and large DNA fragments; or (iii) the introduction of point 
mutations. The development of genome editing kits including 
nucleases, guide RNAs, and recombination templates has greatly 
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accelerated the identification and validation of integration loci in 
S. cerevisiae but also in other yeasts where homologous recombina- 
tion is less efficient. Thus, Nielsen’s group developed the 
EasyClone-MarkerFree vector toolkit [38] allowing stable integra- 
tion in both laboratory and industrial strains of S. cerevisiae and 
high gene expression at these 11] individual sites [39, 40]. More 
recently, the same group developed an Expansion of EasyClone- 
MarkerFree toolkit for S. cerevisiae genome with eight new integra- 
tion sites [41]. The challenge now is to deploy these genetic tools in 
polyploidy industrial strains. Recent papers have demonstrated the 
power of the HI-CRISPR genome editing tools to disrupt four 
genes in diploid and triploid yeast [42 ]. The next step will consist to 
introduce multiple copies of metabolic pathways in polypoid indus- 
trial strains. 

In Pichia pastoris or Kluyveromyces species, which, unlike 
S. cerevisiae, have a low frequency of spontaneous homologous 
recombination, the introduction of transgenes at a predefined 
locus is possible using CRISPR/Cas9 system and clearly favored 
when key genes involved in NHEJ are deleted [43-45], which 
allowed the validation of integration loci. Furthermore, there is a 
strong interest in finding safe harbors in non-conventional organ- 
isms of biotechnological interest such as the microalgae Phaeodac- 
tylum tricornutum [46], in the oleaginous yeast Yarrowia 
lipolytica [47 |. 


Once safe harbor loci are validated, it is easy to create “landing 
pads,” i.e., sites in which transgenes can be routinely inserted for 
stable and reliable expression with predictable homologous recom- 
bination frequency [34]. These landing pads contain generally a 
recombination site and a selection marker. They are very interesting 
tools to create multi-copy site-specific integration platform. Thus, 
multi-copy (18 and 25 copy genome) of 2,3-butanediol biosynthe- 
sis pathway was reported [48, 49]. Recently, several studies have 
reported the development of artificial chromosomes as a tool to 
easily and efficiently assemble the genes and chromosomal elements 
necessary to control the expression of metabolic pathways. These 
systems have the advantage of circumventing the unpredictable 
impact of the chromosomal environment (chromatin accessibility, 
methylation, microsatellite sequences, etc.) on the site-specific inte- 
gration frequency. 

For example, Yarrowia lipolytica (ylAC) artificial chromosomes 
were used to assemble and express a large metabolic pathway 
including three key genes for xylose utilization (XYL1, XYL2, and 
XKS1) and three for cellobiose consumption (CBP1, CDT1, and 
scPGM2) [50]. In addition, a study showed the development of a 
supernumerary neochromosome for rational engineering of the 
yeast genome. The ability of synthetic supernumerary chromo- 
somes to serve as a landing platform for the integration of native 
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1.9 Creating 
Fermentation-Friendly 
Chassis Organisms 


and heterologous metabolic pathways has been demonstrated. 
However, it was noted that neochromosome expression of an 
essential pathway reduced the host-specific growth rate by approxi- 
mately 14—24% [51]. These new genetic tools are useful to express a 
large number of metabolic pathways; the stability of these systems 
in bioreactors will allow them to be validated for industrial 
applications. 

Identification of safe harbors has been a long and tedious 
process but absolutely necessary to ensure a constant and predict- 
able production of target molecules. It is mostly a trial-and-error 
approach that is performed in four steps: random integration of a 
transgene, identification of the integration sites, evaluation of the 
expression level, and then evaluation of the consequences at the cell 
level (transcriptome, growth. etc.). Recently, bioinformatics pipe- 
lines have emerged based on a rational approach whose criteria 
meet those of the established GSH criteria [52]. Although these 
pipelines are currently developed on mammalian cells, it is likely 
that they will facilitate similar toolbox development for use in 
microbial cell factories of biotechnological interest (yeast, bacteria, 
algae, etc.) with rational criteria based on gene density, chromatin 
accessibility, presence of transposable elements, etc. 


One of the common problems encountered in translating 
promising newly engineered microbes with great performance 
under laboratory conditions is that they sometimes struggle to 
maintain the same performance when scaled to full industrial pro- 
duction. This reflects the additional metabolic stress (see above) 
which inevitably occurs when laboratory-scale fermentation is 
shifted to full industrial-scale fermenters which often function 
close to the absolute limits of mass transfer and inevitably provoke 
some degree of environmental heterogeneity. This is frequently 
overlooked during the strain development strategy and is inherent 
to the sequential pattern of process evolution in which the biopro- 
cess engineering aspects are not usually examined until the first 
generation of high-performance hosts is ready. While it would 
make a lot more sense to get the process engineering constraints 
identified and included in the initial project plan from the onset, 
this is not common, at least in academic projects. 

Knowing what can easily be achieved by processing technolo- 
gies and what constraints this might impose on the microbe is 
essential information which could be used to plan how best to 
combine the advantages that strain engineering can bring with 
those solutions which can be resolved by the fermentation and 
downstream stages of the process. It is also vital to take into account 
the physical limitations which are associated with scaling the pro- 
cess. As fermentation volumes increase, then mixing and mass 
transfer limitations appear due to the geometry of the reactor. As 
most production cycles are based on a fed-batch logic with high 
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biomass concentrations to attain the volumetric productivity 
needed, the difficulty of transferring this type of performance 
from classical laboratory apparatus which normally have good 
transfer dynamics to a larger scale fermenter is a real problem. 

We tend to engineer high-performance strains which require a 
very tight control over the conditions that have to be provided in 
the fermenter which are difficult to attain at large scale. More effort 
is needed to design or evolve the chassis organism to be used to be 
able to show a robust phenotype when subject to the inevitable 
transient stress conditions that occur due to lack of reactor homo- 
geneity in which spatiotemporal variations occur in key factors such 
as nutrient availability, pH gradients, dissolved oxygen, co-product 
transient accumulation, etc. These variations provoke low-intensity 
but high-frequency stress for the microbes requiring some effort to 
gain in phenotypic robustness and would ensure facilitated scaling 
of the actual production organisms. Today, computation fluid 
dynamics can model the mixing characteristics of any fermenter, 
and increasingly incorporate biological kinetics [53] and not only 
predict the actual distribution of conditions seen throughout a 
fermentation defining where strain engineering might have to 
look at fitness characteristics but also help design small-scale reac- 
tors which approximate the same constraints as found in full-scale 
fermentations. These constraints can easily diminish the intensified 
flux patterns seen at lab scale and significantly diminish yields. 

Bioreactor design is notoriously conservative and has not really 
evolved to any great extent since the initial investments for indus- 
trial biotechnology and in many cases are unlikely to change radi- 
cally though clearly there is scope to improve mixing within large- 
scale reactors and help offset this dilemma. One major shift which 
mirrors the way batch chemical synthesis is beginning to shift 
towards continuous flow chemistry could help simplify the chal- 
lenges of industrial biotechnology. A relatively small number of 
processes have been developed with a continuous culture system 
which stabilizes a specific pseudo-stationary production environ- 
ment which would greatly facilitate the optimization of strains 
adapted to such stable conditions and able to tune their perfor- 
mance to this constant environment. Currently our carefully 
designed microbial factories, exploited in batch or fed-batch con- 
ditions, have to come to terms with an environment in which the 
principle constraints are changing in a dynamic nature, and indeed 
much of the production phase will not be under conditions in 
which pathway flux has been optimized. Fixing a stable environ- 
ment means that you can optimize for a given condition, though 
obviously the genetic stability issues discussed above become even 
more important. This cultivation method has shown itself to be a 
key factor in understanding microbial physiology and would be a 
stable basis for any manipulation of that same physiology. A sim- 
plified system is always simpler to engineer. 
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2 Conclusions 


Metabolic engineering has an increasingly sophisticated toolbox 
available, and many of the problems that were difficult to resolve 
20 or more years ago are becoming feasible in consolidated strate- 
gies which can identify upfront where the big challenges are going 
to be. Better integration of the entire value chain from the onset 
would speed up the transfer of promising innovation into validated 
processes and avoid some of the bottlenecks that have slowed down 
industrial exploitation of some aspects of microbial cell factories to 
date. Increasing the manner in which we use mathematical model- 
ing and the in silico testbed, so as to focus the experimental input 
on most probable solutions, is the essential glue that can fuse the 
efforts in systems biology and the application potential of metabolic 
engineering. Better understanding how microbes function in real- 
istic industrial conditions will enable robustness to be built into the 
DNA of our high-performance cell factories and create a rapid 
development pipeline to boost success rates when transferring lab- 
oratory studies to industry. Today the framework exists and needs 


to be more widely employed. 
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Abstract 


This chapter outlines the myriad applications of machine learning (ML) in synthetic biology, specifically in 
engineering cell and protein activity, and metabolic pathways. Though by no means comprehensive, the 
chapter highlights several prominent computational tools applied in the field and their potential use cases. 
The examples detailed reinforce how ML algorithms can enhance synthetic biology research by providing 
data-driven insights into the behavior of living systems, even without detailed knowledge of their underly- 
ing mechanisms. By doing so, ML promises to increase the efficiency of research projects by modeling 
hypotheses in silico that can then be tested through experiments. While challenges related to training 
dataset generation and computational costs remain, ongoing improvements in ML tools are paving the way 
for smarter and more streamlined synthetic biology workflows that can be readily employed to address 
grand challenges across manufacturing, medicine, engineering, agriculture, and beyond. 


Key words Machine learning, Synthetic biology, Protein engineering, Metabolic engineering 


1. Introduction 


The era of synthetic biology started long before the term was first 
used by Barbara Hobom in 1980 to describe microbes that were 
genetically modified using DNA recombinant technology 
[1]. Fueled by the molecular biology revolution that took place in 
the mid-twentieth century, scientists set their sights on achieving 
the ability to precisely engineer microorganisms—and by the early 
1990s, their dream was finally realized. Since then, advancements in 
sequencing and genomics have paved the way for the evolution of a 
discipline which aims to create, control, and program cellular 
behavior and metabolism [2]. 

At the turn of the millennium, “synthetic biology” referred to 
the synthesis of unnatural compounds that function in living sys- 
tems. In exploring and taking interchangeable parts from living 
systems (zology) and assembling them unnaturally (synthetic) to 
create devices that resembled living systems, synthetic biologists 
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Fig. 1 A brief timeline of the breakthroughs in the field of machine learning and synthetic biology from the 
1960s and beyond 


attempted to recreate emergent properties of living systems includ- 
ing inheritance and evolution [3, 4]. During its foundational years, 
synthetic biologists created simple circuits for gene regulation and 
tested them on Escherichia coli, a classic molecular biology work- 
horse chosen for its ease of manipulation and our extensive knowl- 
edge on its genetics and genomics [5 ]. 

By the mid-2000s, synthetic biology expanded dramatically in 
its endeavors to create and construct biological systems, with a 
long-term goal of engineering whole genomes [6]. But as assem- 
bling interchangeable parts and designing circuits became more 
complex and ambitious, synthetic biologists had to grapple with 
the disproportionate time required to design systems facilitating 
the proper function of the synthesized circuits (Fig. 1) [7]. 

Indeed, while synthetic biology has already delivered novel 
solutions to long-standing challenges in the global healthcare, 
agriculture, manufacturing, and environmental sectors, there is a 
lingering perception that the field has yet to live up to its full 
potential. The rise of artificial intelligence, robotics, and automa- 
tion, however, may help synthetic biology overcome such percep- 
tion [8, 9]. By reducing the time and cost associated with designing 
intricate biological systems, such frontier technologies stand to 
improve the field’s return of investment [9]. 

Over the years, computational methods have become an essen- 
tial part of biology, mirroring the increase of digitalization across 
practically all sectors. The continuous development of sophisticated 
algorithms have ushered in new and improved tools in statistics, 
simulation, and data management—all of which are reshaping the 
way biological studies are performed [10]. Computational biology, 
with its large datasets streamlined by databases and statistical ana- 
lyses, provides a reference map for the field of biology [11]. 
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With its roots in pattern recognition, statistics, and data opti- 
mization, machine learning (ML) is a particular artificial intelli- 
gence method commonly used in computational biology. For 
instance, ML is typically applied by systems biologists to optimize 
fermentation conditions and product biosynthesis routes—reduc- 
ing the need for laborious benchwork and allowing for higher 
chances of experimental success (Wu et al. 2016b). 

One major objective of ML methodologies is to generate pre- 
dictive models based on an underlying algorithm and a given 
dataset containing features and labels across various samples. A 
typical ML workflow starts by inputting data, which is then pro- 
cessed with a set of mathematical formulas and statistical assump- 
tions. This process, called training, pinpoints the optimal 
configuration of model parameters with the aim of translating 
features into an accurate prediction of labels based on the given 
dataset. After identifying the optimized parameters, a new dataset 
can be used to generate output. A model that can accurately predict 
the training data and independent datasets is deemed to have 
properly “learned.” Models that can accurately predict the training 
dataset but not the independent ones are called “overfit” models, 
while those that can neither predict the training dataset nor gener- 
alize to new data are dubbed “underfit” models—both of which are 
major causes for poor performance in ML approaches. Overfitting 
and underfitting models can be respectively resolved by decreasing 
or increasing the complexity of the model used for learning, respec- 
tively [12, 13]. 

Meanwhile, ML methods fall under two overarching cate- 
gories: supervised and unsupervised learning. Unsupervised meth- 
ods like principal component analysis and hierarchical clustering use 
patterns in the features of the input data to produce visualizations 
that help discriminate changes in groups. In contrast, supervised 
learning is used when labels that can recognize patterns in the input 
data are already known [14]. For example, if the microbiomes of 
healthy and diseased individuals are available, supervised ML can 
help accurately predict if a sample from another individual belongs 
to the healthy or the diseased groups. Both categories fall under 
deep learning, a specialized subset of ML that involves neural net- 
works, or algorithms inspired by the human brain. 

Accordingly, ML strategies can be applied to identify funda- 
mental design principles in synthetic biology—particularly in creat- 
ing components with enhanced novel functions, which, in turn, 
diversify molecular parts that are available for efforts in the field 
[12]. In this introductory chapter, we will discuss the integration of 
ML in synthetic biology, with an emphasis on the subfields of cell 
and metabolic engineering, as well as how such integration can 
enable synthetic biology to meet current challenges in uncovering 
the complexities of biological systems. 
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2 Computational Tools for Synthetic Biology Applications 


2.1 Machine 
Learning for Cell and 
Protein Engineering 


In synthetic biology, the subfield of cell engineering involves the 
assembly of biological components to form gene circuits or net- 
works that can work together with the internal cell machinery to 
restore, improve, or add novel functions to a chosen host cell 
[15]. These biological components often include elements that 
regulate the transcription and translation of proteins, as well as 
transcription factors that can be used to regulate the activity of 
other proteins. 

To design cells that behave in a predictable and reproducible 
way, synthetic biologists have sought to characterize the individual 
performance of known biological components, understand their 
fundamental mechanisms of action, and test the interactions of 
these components within the host cell mostly via trial-and-error 
experimental approaches [16]. While cell engineering techniques 
have become more sophisticated, there are still some hurdles faced 
by synthetic biologists. Given the limited understanding of design 
rules, designing novel biological components and identifying the 
interactions between host cell machinery and engineered compo- 
nents can be a challenge, posing troubleshooting difficulties. To 
this end, ML offers a path for optimally designing and fine-tuning 
biological components with predictable outcomes in the host cell. 
This includes applications in the optimization of gene expression, 
alteration of cellular function, and design of proteins (Fig. 2). 

Tuning gene expression in cells typically involves modifying 
and screening promoter and ribosome binding site (RBS) 
sequences [17] for transcriptional and translational regulation 
through experiments and computational predictive tools. How- 
ever, the latter often requires a comprehensive understanding of 
the regulatory mechanisms controlling gene expression (Choi et al. 
2019). While such tools can be effective, they may not be as useful 
when information is incomplete, especially in the case of 
non-model organisms. 

With synthetic biologists realizing ML’s potential in cell engi- 
neering, they have lessened wet lab optimization experiments— 
turning instead to screening and designing biological components 
in silico. Several research groups, for instance, have reported using 
neural networks, a popular ML algorithm, to guide the data-driven 
design of promoters [18-20] and RBS sequences [21] for 
controlling gene expression. For instance, Meng et al. have 
deployed neural networks to predict promoter strength with 
mutated promoters and RBS sequences as inputs [22]. Notably, 
their algorithm surpassed even mechanistic models based on posi- 
tion weight matrix and thermodynamics methods [23-25 ]. 
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| Optimizing search and 
design of proteins 


Search and annotate 
protein-encoding 
genes 


Predict enzyme 
function 


Predict enzyme 
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Guide directed 
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Guide rational 
protein design 
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structures 


Fig. 2 Applications of machine learning for cell engineering. Three categories where machine learning can be 
used for: optimizing gene expression, optimizing tools for altering cell function, and optimizing the search and 
design of proteins. (Created with Biorender.com) 


Aside from predicting the gene expression of biological com- 


ponents like promoters or RBS from their sequences, other tools 
can also look at factors affecting gene expression. One example is 
SelProm, an open-source plasmid selection tool that houses a data- 
base for plasmid expression strength and a prediction tool based on 
partial least-squares regression [26]. SelProm can compare and 
identify inducible promoter expression systems similar to constitu- 
tive expression systems across various conditions including strain, 
media, inducer concentration, induction time point, and plasmid 
backbone. 

Beyond promoters and RBS sequences, ML can also predict 
gene expression by optimizing the biological components that take 
part in transcription and translation. For example, Tunney et al. 
used a feedforward neural network model in which information is 
constantly “fed forward” from one layer to the next—mirroring 
biological processes—to predict ribosome distribution along 
mRNA transcripts and translational elongation speeds from the 
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coding sequence of mRNA transcripts [27]. Another study 
reported the use of a deep learning technique known as a convolu- 
tional neural network (CNN) to predict protein expression in 
Saccharomyces cerevisiae from the 5’ untranslated region (UTR) 
sequences of mRNAs [28]. The described model generated more 
active 5’ UTRs, leading to higher protein translational and expres- 
sion rates. Likewise, transcription regulation can also be predicted 
with ML. A tool called DeepTFactor also uses a deep neural net- 
work (DNN), to predict the transcription factors of both eukary- 
otic and prokaryotic origins [29]. Promisingly, DeepTFactor can be 
used to understand the transcriptional regulatory systems of an 
organism of interest for cell engineering applications. 

Synthetic biologists have also worked towards controlling gene 
expression by developing RNA-mediated genetic switches 
[30]. One such genetic switch is the riboswitch, an aptamer- 
containing mRNA molecule that can recognize and bind to specific 
ligands and, in turn, control gene expression through a conforma- 
tional change [31]. In 2019, Groher et al. reported combining a 
CNN with a classification algorithm called random forest analysis to 
develop a prediction model that accounted for the aptamer 
sequence’s biophysical properties, including entropy, stem melting 
temperature, and GC content. The model was then used to 
improve the dynamic range of a tandem tetracycline-dependent 
riboswitch (Groher et al. 2019). Another kind of genetic switch is 
the toehold switch, which consists of an RNA hairpin placed at the 
5’ end of a mRNA molecule, allowing translation to occur when 
triggered [31]. While riboswitches control gene expression 
through conformational changes, toehold switches can be distin- 
guished as they control gene expression via base pairing with target 
RNA sequences. In 2020, two studies described predicting toehold 
switch function through DNNs [32] combined with Sequence- 
based Toehold Optimization and Redesign Model (STORM) and 
Nucleic-Acid Speech (NuSpeak) [33]. Taken altogether, these pre- 
dictive tools enabled by neural networks will greatly help in devel- 
oping more robust and sensitive biological circuit components for 
molecular detection, biosensing, and precision diagnostics. 

On top of designing biological components to regulate gene 
expression, there is also a need to design more efficient tools for 
altering cell function. This can be achieved by using genome editing 
tools, such as the CRISPR-Cas system, to remove unwanted genes 
or permanently incorporate foreign biological components into the 
cell genome. While these tools have revolutionized the field of 
synthetic biology, there are still opportunities to improve 
CRISPR-Cas tools in terms of predicting and enhancing sgRNA 
binding to the desired target site as well as minimizing off-target 
binding. 

Earlier studies used the support vector machine model, a type 
of supervised ML, to enhance CRISPR-Cas9 activity [34, 35] but 
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were limited by the small size and low quality of training data. 
However, a combination of higher-throughput screening methods 
and deep learning has improved the accuracy of the newer sgRNA 
activity prediction tools. One example is the DeepCpf1 tool, which 
uses DNNs trained on large-scale sgRNA (AsCpfl: Cpfl from 
Acidaminococcus sp. BV3L6) activity datasets to predict on-target 
knockout efficacy (indel frequencies) [36]. 

Unlike previous studies that trained on medium-scale datasets, 
Kim et al. developed a high-throughput experimental approach 
which generated a large dataset of over 15,000 target sequence 
compositions and their corresponding indel frequencies suitable 
for applying deep learning approaches. Besides predicting 
on-target knockout efficacies, forecasting off-target sg RNA activity 
using regressive models and DNNs helps prevent undesirable edits 
that may result in genomic instability or the functional disruption 
of normal genes [37, 38]. To maximize both on-target efficacy 
(high sensitivity) and minimize off-target effects, Chuai et al. devel- 
oped the tool DeepCRISPR, which uses both unsupervised deep 
representation learning and DNNs. Indeed, DeepCRISPR man- 
aged to surpass classic ML methods and could be generalized to 
other cell types (Chuai et al. 2018). 

ML can also be applied in cell engineering to search and anno- 
tate protein-encoding genes within the genome. This is particularly 
useful for designing metabolic pathways and constructing them in 
production host cells [17]. Traditionally, the hidden Markov model 
is used for this purpose [39, 40]. In this method, genes are first 
identified in the genome via protein-coding signatures like the 
Shine-Dalgarno sequence and then functionally annotated based 
on their sequence homology search against a database of character- 
ized proteins. More recently, deep learning models have been used 
to identify [41, 42] and functionally annotate protein sequences 
[43] in genomes from large high-quality experimental datasets. 
One tool currently being leveraged to pinpoint protein sequences 
is DeepRibo, a DNN-based tool that harnesses high-throughput 
ribosome profiling coverage signals and candidate open reading 
frame sequences to map and identify translated open reading frames 
in prokaryotes. A similar tool, REPARATION, performs the same 
function using a random forest classifier [44]. 

Once new proteins have been discovered, the functional anno- 
tation of their sequences can be performed through DNN-based 
tools like DeepEC, which uses a protein sequence to precisely and 
quickly predict enzyme commission (EC) numbers [43]. EC num- 
bers classify enzymes based on the chemical reactions they catalyze 
and help in the accurate understanding of enzyme functions. 
Beyond DeepEC, alternative EC number prediction tools like Cat- 
Fam [45], DEEPre [46], DETECT v2 [47], ECPred [48], EFI- 
CAz2.5 [49], and PRIAM [50] can also be considered. In addition 
to determining enzyme function, ML could uncover and forecast 
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enzymes that can catalyze novel reactions through enzyme promis- 
cuity. For instance, chemoinformatic techniques, partitioned quan- 
tum mechanics, and molecular mechanics can be used to predict 
metabolite-protein interactions in silico [51]. However, these tech- 
niques are computationally intensive and require domain expertise. 
Likewise, searching and matching promiscuous enzymes to a reac- 
tion are increasingly being performed with more computationally 
efficient techniques, such as the support vector machine [52] and 
the Gaussian process model [53]. These techniques make their 
predictions based on protein sequences (e.g., K-mers), reaction 
signatures (e.g., functional groups, chemical transformation prop- 
erties), and substrate affinity for proteins (Km values). Equipped 
with these tools, metabolic engineers now have new ways to find 
enzymes for novel biochemical reactions when no known enzyme is 
available. 

Another application of ML is the design and engineering of 
proteins. The most common approach is the use of directed evolu- 
tion, where proteins go through iterative experimental rounds of 
mutation and selection until the desired function and performance 
are achieved [54]. ML can guide the directed evolution process by 
reducing the number of experimental iterations to attain the 
desired protein. This involves using previous experimental data 
consisting of each protein’s sequence and its functional perfor- 
mance to generate a library of variants with higher fitness. Wu 
et al. simultaneously deployed multiple ML models and picked 
the models with the highest accuracy to more efficiently evolve 
two proteins: human guanine nucleotide-binding protein (GB1) 
and nitric oxide dioxygenase (NOD) from Rhodothermus marinus 
[55]. ML-assisted directed evolution has also been used to increase 
enzyme productivity [56], change fluorescent protein colors [57], 
and optimize protein thermostability [58]. 

Besides directed evolution, ML can also play a part in rational 
protein design. For instance, UniRep can learn the statistical repre- 
sentations of proteins (e.g., physicochemical properties, structural, 
evolutionary, and functional information) from 24 million Uni- 
Ref50 sequences [59] using neural networks. The tool managed 
to predict the stability of a large series of de novo proteins and 
functional changes from point mutations made in wild-type pro- 
teins. In an exploratory study from George Church’s group at 
Harvard University, they deployed UniRep to optimize the design 
of green fluorescent protein from Aegquorea victoria and TEM-1 
B-lactamase enzyme from E. co/z even from a limited pool of train- 
ing data [60]. Another study used neural networks trained to 
associate amino acids with the neighboring spatial orientation of 
carbon, oxygen, nitrogen, and sulfur atoms within a protein. By 
doing so, the researchers were able to identify novel gain-of-func- 
tion mutations and improve the protein function of three different 
proteins [61, 62]. 


2.2 Computational 
Tools for Metabolic 
Engineering 
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The ability to predict three-dimensional (3D) protein struc- 
tures from amino acid sequences is particularly useful when design- 
ing new proteins. Recently, this area of research experienced a 
breakthrough with the debut of DeepMind’s AlphaFold at the 
Critical Assessment of Protein Structure Prediction (CASP) com- 
petition, where it bested other well-established groups. AlphaFold 
now has a faster version, AlphaFold2, which also uses both neural 
networks and gradient descent [63]. Other algorithms that also 
perform the same function include RoseTTaFold, by David Baker’s 
group in the University of Washington [64] and by Mohammed 
AlQuraishi, an independent fellow in Systems Pharmacology at 
Harvard Medical School [65 ]. 

Predicting the secretability of proteins is particularly interesting 
to synthetic biologists as there are myriad applications that come 
from being able to secrete recombinant proteins from cells, includ- 
ing engineering therapeutic cells to deliver protein drugs [66, 67 | 
and large-scale industrial protein production [68]. The SECRiFY 
tool uses a gradient-boosted decision tree model and CNNs to 
predict the secretability of protein fragments by two yeast species, 
namely, S. cerevisiae and Pichia pastoris, from a custom fragment 
library, surface display, and a deep sequencing readout 
[69]. Another study described the development of the SignalP 
5.0 tool, which utilizes DNNs to make protein signal peptide 
predictions from amino acid sequences [41]. This improved tool 
can detect signal peptides across all domains of life and can distin- 
guish them across Gram-positive, Gram-negative, and archaea 
bacteria. 

These ML-based tools will allow synthetic biologists to effi- 
ciently design and optimize biological components for cell engi- 
neering. Leveraging these tools will enhance the evolution of novel 
and more complex cell designs in areas such as therapeutic cell 
development, precision diagnostics, and industrial biotechnology. 
While ML tools can never replace wet lab experiments, as experi- 
mental validation is still needed, the resulting predictions and 
insights can help accelerate conventional screening and iterative 
experimental procedures as well as guide the design and engineer- 
ing of biological components—reducing the time and resources 
needed to achieve the desired results. 


Broadly speaking, metabolic engineering builds upon the founda- 
tion of cell engineering. Instead of designing and controlling the 
expression of a single gene or the synthesis of a single protein, the 
subfield involves redesigning pathways that alter the metabolism of 
the engineered organism. Specifically, metabolic engineering 
involves modifying the natural chemical reactions of cells to focus 
on manufacturing desired biological compounds. This is often a 
multi-step process involving multiple enzymes. 
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While there are numerous enzyme pathways and unique pro- 
ducts a cell can produce, they typically require the use of a small 
group of common metabolites or cofactors [70]. Accordingly, it is 
necessary to consider the broader cellular context when trying to 
optimize yields of a desired metabolite [71]. For instance, a single 
compound could be a product of many biochemical pathways 
[72]. Although high-yield pathways have been engineered 
through rational design [73-75 ], such efforts work best for simple 
pathways while also requiring detailed knowledge of the enzyme 
reactions involved and significant experimental work. These con- 
siderations have led to a growing interest in using computational 
techniques to approach metabolic engineering like an optimiza- 
tion problem. 

To engineer organisms for high-yield production, one must 
first identify a suitable series of chemical steps for converting sub- 
strates to the desired product. While early works focused on 
enhancing preexisting metabolite yields, advancements in genetic 
engineering technology have since enabled the de novo assembly of 
entire metabolic pathways into production hosts [76, 77]. 

One common method for generating pathways is to combine 
known enzymes into novel pathways using databases such as KEGG 
[78] and BRENDA [79]. These methods can be aided by tools that 
identify plausible pathways for selected substrate-product pairs 
[80]. While such methods have strong predictive power, their 
scope is limited by the selective use of manually curated enzyme 
reaction data. A broader alternative is to create pathways using 
generalized reaction rules, which considers the full biochemical 
reaction space [72]. Similar to cell engineering, these rules permit 
pathway prediction using inferred enzyme promiscuity, enabling 
designs that involve novel metabolites. Notably, Hadidi et al. cre- 
ated the ATLAS database as a repository of all theoretical enzyme 
reactions connecting KEGG metabolites [81 ]. 

However, the main drawback of this approach is its computa- 
tional intensity, though this can be reduced by using defined sets of 
rules to limit searches. With the gradual shift in methods towards 
de novo pathway design, the past years have seen the corresponding 
growth of computational tools and frameworks for this particular 
use case [82-84]. Moreover, it is also possible to suggest pathway 
steps using predicted enzyme activity retrieved from genome 
sequence mining [85]. The most widely used of these tools is 
antiSMASH, a data-based workflow for identifying biosynthetic 
gene clusters and predicting the chemical structure of its metabo- 
lites [86]. Tools based on hidden Markov models like PRISM have 
also been used to predict novel antibiotic compounds from genome 
databases [87 ]. 
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After an enzyme pathway is designed, the next step involves 
optimizing the host strain for metabolic engineering metrics like 
titer, rate of production, and yield (TRY). This is usually achieved 
by tuning the gene expression of both the synthetic and endoge- 
nous pathways. One approach is to create mathematical models that 
represent cellular processes to help identify potential production 
limitations. The most widely used of these techniques is known as 
flux balance analysis, which models the flow of chemical com- 
pounds through metabolic networks using stoichiometric rules 
[88]. The ubiquity of flux balance analysis can be attributed to its 
generality and computational affordability [89], with an increasing 
range of software applications being developed for this purpose 
[90]. Meanwhile, tools such as COBRA use computational frame- 
works to predict gene regulation or knockouts that increase the 
production of a target metabolite [91]. However, the reductionist 
nature of such analyses can limit their predictive power and possible 
use cases. 

A more quantitative alternative is to use mechanistic models to 
capture metabolic networks. For instance, chemical species can be 
described by using enzyme kinetics as a series of rate laws, which, in 
turn, can be analyzed as a series of ordinary differential equations. 
In doing so, these mechanistic models facilitate sensitivity analyses 
that help determine potential metabolic engineering targets for 
increasing TRY. As opposed to flux balance analysis (FBA), mecha- 
nistic modeling provides insights into isozyme activity and how 
metabolite pools may be affected by gene manipulation [92 ]. How- 
ever, the technique’s main drawback is its reliance on having char- 
acterized kinetic data for accurate predictions. While validated 
models are common for species like E. coli [93-95], characterizing 
large numbers of enzymes is a laborious and time-consuming effort 
that can be unfeasible for non-model organisms. 

In contrast, the other major approach for strain optimization is 
through data-driven ML analysis. Within the context of metabolic 
engineering, supervised ML algorithms do factor in multi-omics 
data and culture conditions to predict metabolite production. Since 
they do not require prior knowledge of the underlying biochemis- 
try, ML approaches are widely applicable for combinatorial pathway 
assembly [96] and improving TRY (Czajka et al. 2021). At present, 
one major challenge for ML in metabolic engineering lies in gen- 
erating large enough biological datasets required for training algo- 
rithms. To bridge this limitation, Radivojevic et al. developed ART, 
a ML tool combining network optimization and experimental 
design [97]. By recommending experimental designs to achieve 
the specified objective, the team demonstrated predictive modeling 
with as few as 19 built strains in a test cycle. 
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Despite having fundamentally different bases, there is a devel- 
oping interest in integrating mechanistic models with ML. In prin- 
ciple, this leverages on the advantages of both approaches to 
provide data-driven predictions and insights into the underlying 
biology. For example, imposing model constraints based on 
biological context has been shown to increase prediction accuracy 
by disregarding biologically impossible solution spaces [98]. One 
direction being explored is the use of data obtained from mecha- 
nistic models as an input for ML. In silico flux predictions, for 
example, have been shown to increase the predictive power of ML 
in whole-genome models of yeast and cyanobacteria 
[99, 100]. Likewise, genome-scale models can be used to identify 
engineering targets and focus the scope of ML algorithms 
[101]. An alternate approach is to use ML to predict the parameters 
used in mechanistic models. As shown by Heckmann et al., enzyme 
turnover rates determined by ML models outperform naively 
assigned values at flux predictions [102]. 

Beyond engineering changes in the host organism, computa- 
tional methods can also improve TRY by optimizing bioreactor 
conditions. After all, microbial growth is influenced by environ- 
mental conditions including substrate availability, oxygen content, 
and pH—with any changes to these reactor parameters thereby 
affecting metabolite production. However, experimentally opti- 
mizing bioreactor cultures can be lengthy and costly. To this end, 
computational strategies to optimize bioreactor performance have 
been developed, including more contemporary neural network 
models that can suggest conditions for improving metabolite pro- 
duction [103]. Moreover, biochemical production is often carried 
out in high-volume bioreactors, which can have drastically different 
environments from small-scale lab cultures [104]. Design tools 
have since been adapted to accommodate some common limita- 
tions. For example, metabolic burden can be integrated in genome- 
scale models to account for the energy cost of expressing synthetic 
pathways [105]. Meanwhile, heterogeneous population modeling 
can be used to account for cell variability and stochasticity in 
bioreactors [106]. 

Ultimately, one of the goals of metabolic engineering is to 
integrate pathway design as well as the optimization of host strain 
and culture conditions into a combined pipeline (Fig. 3). Having a 
standard workflow increases reproducibility, helps reduce the time 
needed from project conceptualization to realization [107], and 
enables the use of experimental automation to increase throughput. 
Still, despite the advantages of having a comprehensive pipeline for 
metabolic engineering, there is a lack of published research describ- 
ing such approaches. This creates an opportunity for industries to 
develop proprietary workflows for engineering organisms for 
industrial applications and the academic sector to explore ways to 
incorporate ML algorithms in streamlining the engineering of 
biological pathways in organisms. 
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Fig. 3 Applications of machine learning in metabolic engineering workflows. A metabolic engineering project 
can be broadly split into three phases: designing metabolic pathways, optimizing cells for production, and 
optimizing industrial processes for product yield. Many computational tools have been developed to guide 
design throughout the process. (a) Pathways for synthesizing target products can be designed using known 
biochemical reactions or predicted gene functions. This can help identify hosts with natural industrially 
relevant properties. (b) Strains are engineered to maximize production titer, rate, and yield. Mechanistic 
approaches use knowledge of the underlying biology to predict metabolite production. In contrast, data-driven 
methods identify patterns from large datasets to suggest improvements. Recent efforts have attempted to 
combine the two approaches to increase predictive power. (c) Downstream bioprocesses are optimized for 
high product output. In silico prediction can greatly reduce the time required to adapt a laboratory strain for 
industrial-scale production. (Created with Biorender.com) 


3 Conclusions and Future Outlooks 


Through the examples described in the chapter, it is apparent that 
ML is already shaping synthetic biology across several subfields. 
With the advancements in the development of computational 
tools for cell, protein, and metabolic engineering, ML is increas- 
ingly being integrated into workflows as a tool for providing richer, 
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data-driven insights that can then smartly inform experimental 
approaches. Crucially, the use of ML techniques allows the genera- 
tion of predictive models even with limited knowledge and data on 
underlying mechanisms. By first modeling hypotheses in silico that 
can then be selectively tested through experiments, ML increases 
the speed and precision of biological research projects all while 
reducing the time and resources needed. As a result, larger libraries 
or combinatorial approaches that may otherwise be too trouble- 
some to study can be screened. 

Despite these initial advantages, there are several challenges 
that must be addressed to unleash the full potential of 
ML. Arguably, the largest issue lies in the difficulty of generating 
the large datasets required for training models. While ML 
approaches can facilitate biological design, much of the building 
and testing is done by hand and is hard to scale up. Additionally, the 
lack of a fixed standard in data collection and reporting across labs 
and institutions contributes to the difficulty in directly comparing 
data from multiple studies. While these challenges can be addressed 
by experimental automation (e.g., liquid handling systems) to a 
certain extent, such technologies remain costly. At present, training 
ML models may still require high computational costs, but the 
continuous improvement of computing power through the years 
may make the costs more manageable. 

Efforts among the community to make published models open- 
source are already making ML approaches more accessible to 
researchers. With the rise of computational and ML tools, we predict 
that they will play an integral role in future synthetic biology work- 
flows. Through these tools, we envisage the possibility of precise yet 
broadly applicable models that can facilitate de novo bioengineering 
with unparalleled efficiency. ML coupled with automation will allow 
the faster completion of research, from design to execution, with 
minimal human intervention. This would enable scientists and engi- 
neers to take on a more strategic role—shifting their time and focus 
to higher-value activities like planning and innovation. Such innova- 
tions are increasingly necessary in the coming years, with the global 
need for sustainable practices in manufacturing, medicine, engineer- 
ing, and agriculture. Computational tools have revolutionized syn- 
thetic biology over the past two decades, and we look forward to 
seeing where it can lead us to in the future. 
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Design and Analysis of Massively Parallel Reporter Assays 
Using FORECAST 


Pierre-Aurelien Gilliot and Thomas E. Gorochowski 


Abstract 


Machine learning is revolutionizing molecular biology and bioengineering by providing powerful insights 
and predictions. Massively parallel reporter assays (MPRAs) have emerged as a particularly valuable class of 
high-throughput technique to support such algorithms. MPRAs enable the simultaneous characterization 
of thousands or even millions of genetic constructs and provide the large amounts of data needed to train 
models. However, while the scale of this approach is impressive, the design of effective MPRA experiments 
is challenging due to the many factors that can be varied and the difficulty in predicting how these will 
impact the quality and quantity of data obtained. Here, we present a computational tool called FORE- 
CAST, which can simulate MPRA experiments based on fluorescence-activated cell sorting and subsequent 
sequencing (commonly referred to as Flow-seq or Sort-seq experiments), as well as carry out rigorous 
statistical estimation of construct performance from this type of experimental data. FORECAST can be used 
to develop workflows to aid the design of MPRA experiments and reanalyze existing MPRA data sets. 


Key words Massively parallel reporter assay, Cell sorting, Sequencing, Inference, Experimental 
design, Bioinformatics, Synthetic biology 


1 ‘Introduction 


In order to effectively engineer biology, it is necessary to under- 
stand the complex relationship between DNA sequence and 
function [1-7]. However, this relationship is notoriously hard to 
map out due to the high-dimensional structure of the underlying 
functional landscape. Data-driven approaches offer a means to 
address this challenge but require large amounts of training data 
that can be costly to generate experimentally [8]. To tackle this 
issue, new massively parallel reporter assays (MPRAs) have become 
popular, allowing for large-scale genotype-to-phenotype maps to 
be produced [9-18]. The most prominent MPRA methods are 
Flow-/Sort-seq-based techniques [19] where vast libraries of 
genetic constructs are designed to control the expression of a 
fluorescent reporter (normally a fluorescent protein). Large pools 
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Fig. 1 Overview of Flow-/Sort-seq-based massively parallel reporter assays (MPRAs). Diverse genetic 
constructs (genotypes) can be generated in several different ways. Common examples include random pooled 
combinatorial assembly and massively parallel oligo synthesis. Each genotype must have a fluorescence- 
based phenotype when placed in a living cell. Flow-/Sort-seq then takes a mixed pool of cells transformed 
with this genotype library and uses fluorescence-activated cell sorting (FACS) to separate cells with similar 
fluorescence values into a discrete set of bins. Constructs in each bin are barcoded and then sequenced to 
allow for the genotypes present in each bin to be recovered. This then provides the genotype-to-phenotype 
data necessary for training machine learning models 


of these genetic constructs are assembled and inserted into living 
cells. Fluorescence-activated cell sorting and sequencing is then 
used to estimate the phenotype (i.e., fluorescence level) of each 
genetic construct (Fig. 1). This approach has enabled the parallel 
measurement of up to 100 million genetic constructs at a 
time [20]. 

Genomic regulation has been increasingly studied in this way with 
MPRA data being combined with machine learning algorithms 
(e.g. recurrent or convolutional neural networks [21—24]) to derive 
predictive models that can be subsequently used for forward design 
[20, 25-29]. To further improve model performance, efforts have 
mostly been geared towards collecting larger amounts of data 
[12, 20] or developing more flexible machine learning algorithms 
that learn more quickly [30, 31]. However, an aspect that has been 
overlooked is the design of the MPRA experiments themselves and the 
quality of the data produced. Every MPRA experiment has many 
parameters in its design, and all of these can affect the accuracy of the 
inferred results. Therefore, there is a need for systematic approaches to 
explore MPRA experimental design space and for more rigorous data 
analysis methods to ensure suboptimal decisions are avoided. 

In this chapter, we present a computational tool called FORE- 
CAST that aims to address this problem. FORECAST is a Python 
package that can accurately simulate Flow-/Sort-seq-based MPRA 
experiments and provide maximum likelihood (ML) and method of 
moments (MoM)-based estimators for the accurate inference of 
construct performance from MPRA data. Here, we demonstrate 
how these features can be used to explore the design parameters of 
an MPRA experiment and ensure the robust analysis of MPRA- 
based data. 
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2 Materials 


2.1 Software FORECASE requires that the following software tools and 
Dependencies packages are installed and accessible from the command line. 


1. Python version 3.7 or later (we recommend a distribution like 
Anaconda). 


2. Git version 2.20 or later. 


3. Conda version 4.0 or later. 


2.2 Installation 1. Acopy of the latest version of FORECAST can be downloaded 
by running the command: 


git clone https://gitlab.com/Pierre-Aurelien/forecast.git 


2. This will create a directory called “forecast” with the structure 
shown in Fig. 2. It is crucial that all commands are executed 
from within the root of this directory. To move to this direc- 
tory, use the command: 


cd forecast 


forecast (root install location) 


LICENSE 
README .md 


forecast 

__init__.py 
run_simuLation.py 
run_inference.py 
data 
figure 
investigation 
protocol 
util 

L__test 

t--- out 


Fig. 2 Structure of the FORECAST code repository. Directories are shown in bold. 
Each directory in the “forecast” directory has a key function: “data” holds data 
samples used for simulation; “figure” contains scripts to produce output plot; 
“investigation” holds scripts capturing analyses that are possible; “protocol” 
contains the steps to model the MPRA experiment during simulation; and “util” 
holds general utility functions. The “test” directory is used for testing functions, 
and the dashed “out” directory is created upon execution of a simulation or 
analysis script and used to hold directories containing the outputs from each of 
these 
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3 Methods 


3.1. Simulation of 
MPRA Experiments 


3. We recommend creating a conda environment called “forecast” 
to house the additional Python packages that are required. This 
can be done by running the command: 


conda env create -f environment.yml 

4. This new environment must then be activated using: 

conda activate forecast 

5. Finally, all Python dependencies can be installed by running: 
pip install -e . 


6. FORECAST is now ready to be used (see Note 1). 


FORECAST is a Python package that provides several command- 
line tools to aid the simulation and analysis of Flow-/Sort-seq- 
based MPRA data (herein referred to as MPRA data). In the 
following sections, we outline each of the available commands 
and the scope of their use when simulating MPRA experiments or 
analyzing MPRA data. 


FORECAST can generate biologically realistic MPRA data by simu- 
lating each of the steps in an MPRA experiment. This includes cell 
sorting, PCR amplification during sequencing, library preparation, 
and finally sequencing. To do so, it requires input files providing 
key parameters for the fluorescence output distributions of each 
construct in the library to be simulated. 


1. To provide the parameters for the output fluorescence distri- 
bution of each construct in the library, a tab-delimited comma- 
separated values (CSV) file is required. We allow for either 
gamma or lognormal distributions to be used, with the shape 
a and scale 6 or mean yp and standard deviation o parameters 
provided, respectively. Each row in the CSV file corresponds to 
a separate construct (genotype) with the construct ID, a or yp, 
and finally 4 or o, depending upon the distribution type 
(gamma or lognormal, respectively). If using a gamma distri- 
bution, the file must be named “library_gamma.csv,” and if 
using a lognormal distribution, the file must be named “librar- 
y_lognormal.csv.” We provide examples of both types of these 
files derived from real MPRA experiments [12, 32] in the 
“data/gamma” and “data/lognormal” directories. These are 
used by default by all FORECAST commands unless a user 
specified library is provided. 


3.2 Inferring 
Construct 
Performance from 
MPRA Data 
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2. Once the distributions for constructs in a library are available as 
an appropriately formatted CSV file, simulation of an MPRA 
experiment can be performed. This is carried out using the 
command: 


generate auto_bin 


This will create a directory called “simulation_” followed 
by the date and time in the “out” directory. Three CSV files are 
created within this directory. The first “cells_bins.csv” is a 
tab-delimited CSV file containing a single row, where each 
entry represents the number of cells sorted per bin. The second 
“sequencing.csv” is also a tab-delimited table where each row 
corresponds to a different genetic construct, and each column 
indicates the read counts per bin. The final “metadata_simula- 
tion.csv” is a tab-delimited table detailing the parameters used 
to conduct the simulation. Examples of these files can be found 
in the “data/flow_seq” directory. 


3. By default, the “generate” command will use the data and 
options specified in Table 1. These can be individually altered 
to enable user-defined simulations. For example, the following 
command: 


generate auto_bin --bins 8 --reads 1e8 


will simulate a MPRA experiment using the default gamma- 
distributed construct library with 8 log-spaced bins and 
100 million sequencing reads. 


Once MPRA data has been generated computationally (as in Sub- 
heading 3.1) or measured from experiment, it is necessary to infer 
the performance (i.e., average fluorescence) of each genetic con- 
struct. FORECAST can generate estimates using both maximum 
likelihood (ML) and method of moments (MoM) approaches 
assuming either a gamma (the default) or log normal distributions 
for the underlying data. 


1. To analyze MPRA data, it must be in a compatible format. 
Specifically, FORECAST requires a directory containing two 
CSV files: (1) “cell_bins.csv” that contains a tab-delimited list 
(m elements long) of the number of cells sorted into each of the 
m bins and (2) “sequencing.csv” that contains rows 
corresponding to each construct (unique genotype) and col- 
umns where the first denotes the “ID” of the construct and the 
following 7 columns contain the number of reads recovered for 
that construct in each bin after cell sorting. The format of these 
files is identical to those generated by the simulator described 
in Subheading 3.1, allowing simulated data to be immediately 
analyzed. 
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Table 1 
Command-line options for the generate command 


Flag Type Default Description 


--distribution String gamma Fluorescence distribution: 
“gamma” or “lognormal” 


== (SAL PAG) Float le6 Number of cells sorted 
--reads Float 1e5 Total number of sequencing 
reads 
--ratio_amplification Float le2 PCR amplification ratio 
--bias_library Boolean FALSE Allow for a different number 
of cells for each construct 
--metadata_path Path forecast/data/gamma Path to construct distributions 
==Owle_eicla Path out/ Path for output files 


simulation DATETIME 
15 _ Gini) Integer 1 Fluorescence per protein 
auto_bin* 


=S1E _ ime Float 1e5 Maximum measurable 
fluorescence value 


--bins Integer 12 Number of logarithmically 
spaced bins used for sorting 


custom_bin* 


--upper_bounds List WeP alte sme: Upper fluorescence bounds of 
each bin 
First upper fluorescence 
bound must be greater than 
lau. 


“The auto_binand custom_bin options are mutually exclusive and should be specified directly after the p Lot 
keyword. Flags specific to their operation are provided below each option in the table 


2. Inference can then be performed by running: 


infer auto_bin --metadata_path DATA_PATH 


Here, “DATA_PATH” is the path to the input files 
described in the previous step, and if no “--metadata_path” 
option is provided, the latest simulation output from the “out” 
directory will be used. This command will create a directory 
called “inference_” followed by the date and time in the “out” 
directory. This new directory will contain four CSV files: (1) a 
copy of the input “cells_bins.csv” file; (2) a copy of the 
“sequencing.csv” file; (3) a tab-delimited file named “results. 
csv” holding the results from the inference step (each row in 
“results.csv” corresponds to a single genetic construct; 


Table 2 
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Column descriptions for result files generated by the infer command 


Column name 


Description 


a_mle 
b_mle 
a_se 
b_se 
a_mom 
b_mom 
mean 


st_dev 


inference_grade 


score 
mu_mle 
sigma_mle 
mu_se 


sigma_se 


Maximum likelihood (ML) estimate of the shape parameter 

ML estimate of the scale parameter 

ML estimate of the standard error for the shape parameter 

ML estimate of the standard error for the scale parameter 

Method of moments (MoM) estimate of the shape parameter 

MoM estimate of the scale parameter 

Fluorescence mean. Log fluorescence mean if using the lognormal distribution 


Fluorescence standard deviation. Log fluorescence standard deviation if using 
the lognormal distribution 


Quality of ML inference (lower is better): (1) ML is successful; (2) ML possible, 
but standard errors can’t be derived as the observed Fischer matrix is not 
invertible; (3) the construct has only been sequenced in one bin, so ML is not 
useful and only MoM inference is conducted; (4) no inference possible, 
construct has not been sequenced at all 


Percentage of reads at the first or last bin (lower is better) 

ML estimate of the lognormal wv parameter 

ML estimate of the lognormal o parameter 

ML estimate of the standard error for the lognormal y parameter 


ML estimate of the standard error for the lognormal o parameter 


descriptions of each column are provided in Table 2); and (4) a 
tab-delimited table named “metadata_inference.csv” filled with 
all parameters used to conduct the inference. It should be 
noted that this inference step is computationally expensive 
due to the likelihood calculations (see Note 2). 


3. By default, the “infer” command will use the data and options 
specified in Table 3. The “infer” command has many options to 
tailor how the inference step is performed. These are described 
in Table 3. For example, the following command: 


infer custom_bin --upper_bounds 1le2 1e3 1e4 2e4 --last_index 


all --metadata_path DATA_PATH 


will conduct inference on the data located in DATA_PATH and 


will assume the fluorescence upper bound of each bin is 
100, 1000, 10,000, and 20,000, respectively. 
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Table 3 
Command-line options for the infer command 


Flag Type _ Default Description 


--distribution String gamma Fluorescence distribution. Either 
“gamma” or “lognormal” 


--metadata_path Path Latest simulation Path to MPRA data 
=—OUle_jenelal Path out/ Path for output files 
simulation DATETIME 

iE aLi? Sic_aLinGlayx Integer 0 Starting index for inference 

--last_index String sample Final index for inference: “‘sample” 
(only 100 constructs) or “all” (all 
constructs ) 

--num_workers Integer -1 (all CPUs) Number of parallel processes to use 

—-verbose Integer 1 Display progress messages: 1 = TRUE, 
0 = FALSE 

auto_bin 

——f max Float 1le5 Maximum measurable fluorescence value 


custom_bin 


--upper_bounds List le2 le3 le4 Upper fluorescence bounds of each bin 
First upper fluorescence bound must be 
greater than 1] a.u. 


“The auto_binand custom_bin options are mutually exclusive and should be specified directly after the p Lot 
keyword. Flags specific to their operation are provided below each option in the table 


3.3 Optimizing the When designing an MPRA experiment, it is useful to be able to 

Design of MPRA assess how various choices such as sequencing depth (i.e., total 

Experiments number of sequencing reads), number of bins used during cell 
sorting, etc. affect the accuracy of the inferred fluorescence distri- 
butions for each construct. Typically, this would be done through 
experimental trial and error. To avoid this, FORECAST provides 
the facility to simulate factorial designs where several design para- 
meters are varied between a set of discrete values in all possible ways 
and simulations used to assess their effect. This allows for the 
exploration of the experimental design space and can be used to 
pick parameters that ensure the most accurate inference of con- 
struct performance from experimental data. 


1. A full factorial combination of parameters for both simulation 
and analysis of an MPRA experiment can be performed by 
using the command 


factorial auto_bin 
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This will create a directory called “factorial_design_” fol- 
lowed by the date and time in the “out” directory. For each 
simulation, three CSV files are created as described in Subhead- 
ing 3.2. For each file, the parameters are indicated in brackets at 
the end of the filename with the following order of parameters: 
“fmax” maximum measurable fluorescence level; “distribu- 
tion” statistical distribution underlying the fluorescence 
observed for each construct; “diversity” total number of differ- 
ent genetic constructs assessed; “bias” a Boolean indicating if 
all genetic designs in the library are sampled equally; “rep” the 
replicate number; “seq” total number of sequencing reads; 
“bins” number of bins used for cell sorting; “size” total num- 
ber of cells sorted; “pcr_amp” amplification ratio during the 
PCR step; and “f_amp” factor capturing fluorescence per pro- 
tein. By default, this command will only run the simulation and 
inference steps for the default parameters shown in Table 4. 


2. To specify a user-defined set of experimental parameters to use, 
command-line flags followed by space-delimited values can be 
used. For example, the following command: 


factorial auto_bin --reads 1e3 1e4 --bins 12 16 22 


will perform a full factorial analysis where experiments will be 
simulated for all combinations of 1 x 10° and 1 x 10* total 
sequencing reads and 12, 16, and 22 bins for cell sorting. The 
full set of available flags is described in Table 4. 


3.4 Assessing the Several tools are provided to help visualize the impact of both 
Accuracy of the experimental design and inference methods on the accuracy of the 
Inferred Distributions —_— estimates. 


1. When performing inference on MPRA data, the output files 
(i.e., “‘results.csv”) include additional information about the 
quality of the inference step that can be useful in filtering out 
those constructs where there is a large amount of uncertainty in 
for the inferred fluorescence distributions. Specifically, each 
row will contain an “‘inference_grade” value that can be used 
to manually filter constructs where issues during the inference 
step arose. For example, values >1 indicate errors in assump- 
tions of the mathematical methods or a lack of data making the 
inferred distributions inaccurate (see Table 2 for a full descrip- 
tion of all inference grades). This value can be used to manually 
filter only those with accurate inference (i.e., “inference_- 
grade” = 1), if required 

2. In addition, when working with simulated MPRA data and 
wanting to compare estimates of parameters for each inference 
method, the following command can be used 


plot_ci 
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Table 4 


Command-line options for the ¢actorial command 


Flag Type Default Description 

--rep Integer 1 Number of replicates 

=—iF ime Float 1e5 Maximum measurable 
fluorescence value 

S—CliLSie Te sLSUlie aLOra String gamma Fluorescence distribution: 
“gamma” or “lognormal” 

=—19) Lia Integer 12 Number of log-spaced bins 
used for sorting 

= "S126 Float 1e6 Number of cells sorted 

= GES Float 1e5 Total number of sequencing 
reads 

--ratio_amplification Float le2 PCR amplification ratio 

--bias_library Boolean FALSE Use a different number of cells 
for each construct 

--metadata_path Path forecast/data/gamma Path to construct distributions 

==OWle_jaela Path out/ Path for output files 

simulation DATETIME 

=1F _ Suny) Integer 1 Fluorescence per protein 

SIE Lie Sie_slinclenx Integer 0 Starting index for inference 

== heiSie_sliaveliox String sample Final index for inference: 
“sample” (only 
100 constructs) or “all” (all 
constructs) 

--num_workers Integer —1 Number of parallel processes 
to use (—1 = all physical 
CPU cores) 

—-verbose Integer 1 Display progress messages: 


1 = TRUE, 0 = FALSE 


This will create a directory called “figure_CI_” followed by 
the date and time in the “out” directory. This directory will 
contain two files (plots), one for each distribution parameter 
(either a and b for gamma distributed data or w and o for 
lognormal distributed data) showing the estimated parameters 
inferred by the ML and MoM methods, as well as the ground 
truth value. An example of the output is shown in Fig. 3. 
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Fig. 3 Example of plot_ci command output. (a) Comparison of estimates for the a parameter for an 
underlying gamma distribution. (a) Comparison of estimates for the b parameter for an underlying gamma 
distribution. Each point denotes an individual design with error bars for the maximum likelihood estimator 
showing the 99% confidence interval. MoM method of moments, ML maximum likelihood 


3. By default, the script will plot the results from the latest 
simulated data in the “out” directory. This can be changed 
though by providing flags such as: 


plot_ci --library_path LIBRARY_PATH --metadata_path DATA_PATH 


Here, “LIBRARY _PATH” is the path to the construct 
library that contains all the ground truth values for each con- 
struct (e.g., location of the “library_gamma.csv” or “library_- 
lognormal.csv” files), and “DATA PATH” is the path to 
simulation data from this construct library. 
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3.5 Visually It can be useful to visually compare the inferred fluorescence dis- 

Comparing Individual tributions for individual constructs to the underlying sequencing 

Inferred Distributions read histograms and potentially the ground truth if the data is 
simulated. This allows for the accuracy of the inference to be 
assessed on a per construct basis and further analysis of those 
constructs that may not adhere to the expected shape of the fluo- 
rescence distribution. 


1. For comparisons between the ML- and MoM- inferred distri- 
butions, as well as the read depth histogram, the following 
command can be used 


plot_pdf auto_bin 


This will create a directory called “figure_pdf_” followed 
by the date and time in the “out” directory. An example of the 
plot produced is shown in Fig. 4. By default (without any 
arguments), the script will plot the results for the latest 
simulated MPRA data in the “out” directory for the first con- 
struct in the library. 


Read count 
——— ML inference 
—— MoM inference 


nN o > 
co o i=] 
co o co 


Estimated number of cells per bin 


co 
co 


10° 10! 10? 10° 10* 10° 
Fluorescence (a.u.) 


Fig. 4 Example of p 1ot_pdf command output. The histogram denotes the number of cells per bin, while the 
two lines indicate the fluorescence distributions inferred from it. MoM method of moments, ML maximum 
likelihood, pdf probability distribution function, a.u. arbitrary units 
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Fig. 5 Example of p1ot_a11 command output. The histogram denotes the number of cells per bin, while the 
two continuous lines indicate the fluorescence distributions inferred from it and the dashed black line the 
underlying ground truth distribution used to generate the data. MoM method of moments, ML maximum 
likelihood, pdf probability distribution function, a.u. arbitrary units 


2. For simulated data, the following command allows for the 
ground truth to be additionally plotted on the figure 


plot_all auto_bin 


Again, this will create a directory called “figure_pdf_” fol- 
lowed by the date and time in the “out” directory. An example 
of this type of plot is shown in Fig. 5. By default, the script will 
plot the results for the latest simulated MPRA data in the “out” 
directory for the first construct in the library. 


3. Both these commands can be provided with optional flags 
(detailed in Table 5) that allow for the plot to be customized. 
For example, the following command will create a comparison 
plot for construct number 13 from the last generated 
simulation data: 


plot_pdf auto_bin --construct 13 
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Table 5 


Command-line options for the piot_pdf and p1ot_al11 commands 


Flag Type __ Default Description 

--distribution String gamma Fluorescence distribution: “‘gamma” or 
“lognormal” 

==COMSierWETt Integer 0 Construct index to plot 

--metadata_path Path Latest simulation Path to MPRA data 

--library_path Path  forecast/data/ Path to construct distributions (only for plot_all) 

gamma 

==llegeicl_ILoe String right Location for the legend: “‘left” or “right” 

auto_bin 

--f_max Float 1le5 Maximum measurable fluorescence value 

=—15) Lins Integer 12 Number of log-spaced bins used for sorting 

custom_bin 

--upper_bounds List le2 le3 le4 Upper fluorescence bounds of each bin 

First upper fluorescence bound must be greater 

than 1 a.u. 


*The auto_binand custom_bin options are mutually exclusive and should be specified directly after the p Lot 
keyword. Flags specific to their operation are provided below each option in the table 
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When using FORECAST, it is essential that the “forecast” conda 
environment is activated (see Subheading 3.1, step 4) and all 
commands should be run for the root directory of the tool. 


. Inferring construct performance from the MPRA data is com- 


putationally expensive. FORECAST parallelizes this step to 
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Modeling Protein Complexes and Molecular Assemblies 
Using Computational Methods 


Romain Launay, Elin Teppa, Jeremy Esque, and Isabelle Andre 


Abstract 


Many biological molecules are assembled into supramolecular complexes that are necessary to perform 
functions in the cell. Better understanding and characterization of these molecular assemblies are thus 
essential to further elucidate molecular mechanisms and key protein-protein interactions that could be 
targeted to modulate the protein binding affinity or develop new binders. Experimental access to structural 
information on these supramolecular assemblies is often hampered by the size of these systems that make 
their recombinant production and characterization rather difficult. Computational methods combining 
both structural data, molecular modeling techniques, and sequence coevolution information can thus offer 
a good alternative to gain access to the structural organization of protein complexes and assemblies. Herein, 
we present some computational methods to predict structural models of the protein partners, to search for 
interacting regions using coevolution information, and to build molecular assemblies. The approach is 
exemplified using a case study to model the succinate-quinone oxidoreductase heterocomplex. 


Key words Protein-protein interaction, PPI, Molecular assembly, Protein structure prediction, 
Protein-protein docking, Sequence coevolution 


1. ‘Introduction 


Protein-Protein Interactions (PPIs) play an important role in the 
functioning of living cells, including cell-to-cell interactions and 
metabolic and developmental control [1, 2]. Most cellular 
functions are mediated by the assembly of proteins as more than 
80% of the proteins operate in vivo in the form of homo- or hetero- 
oligomers [3] whose constituents assemble/disassemble dynami- 
cally [4]. Interaction between the proteins can be permanent or 
transient. While permanent interactions will form a stable protein 
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complex, the transient interactions are rather involved in signaling 
and regulation pathways or substrate /metabolite channeling [2, 5, 
6]. Better understanding these molecular assemblies and PPIs is 
thus of major importance to further elucidate molecular mechan- 
isms of cellular processes, engineer synthetic metabolic pathways 
for synthetic biology, or identify drug targets for biomedical 
applications [5]. 

PPIs can be investigated at different levels. In vivo, yeast 
two-hybrid (Y2H, Y3H) techniques enable to detect protein 
interactions, while in vitro, a variety of methods can be used such 
as tandem affinity purification, affinity chromatography, coimmu- 
noprecipitation, protein arrays, protein fragment complementa- 
tion, phage display, and mass spectrometry [6-8] among others. 
At the structural level, investigation of PPIs has largely benefited 
from the growing number of protein-protein complexes solved in 
recent years using different biophysical techniques, such as X-ray 
crystallography, nuclear magnetic resonance spectroscopy, and 
cryo-electron microscopy [7]. To complete this arsenal of 
approaches, in silico molecular modeling based on a combination 
of template-based methods and docking approaches that can inte- 
grate experimental restraints (i.e., coevolution information) has 
also emerged as a powerful technique to investigate protein assem- 
blies, in particular when experimental data are lacking [3, 9]. 

In this chapter, we provide a brief introduction to computa- 
tional methods that allow to predict structural models of proteins, 
to search for interacting regions using inter-protein coevolution 
information, and to model and analyze molecular assemblies. The 
use of some of these methods and tools is illustrated for the model- 
ing of the succinate-quinone oxidoreductase heterocomplex as a 
case study. 


2 Methods for Building a 3D Model of a Protein 


Predicting the three-dimensional structure of a protein based on its 
sequence is still an open problem in research. Protein structure 
prediction methods on the basis of protein sequences are based 
on two principles: (i) protein structure is more conserved across 
evolution than protein sequence, and (ii) there is a finite and 
relatively small (less than 10,000) number of unique protein folds 
in Nature [10]. 

Structure prediction methods are broadly classified into two 
categories: (a) template-based modeling (which uses one or several 
known structure(s) as template(s)) and (b) template-free modeling 
(which predicts a protein structure without using a significant 
template). There are also hybrid approaches that combine the two 
kinds of methods. 
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2.1 Template-Based 
Methods 


New modeling methods or corrections to existing methods 
continually emerge. There are several ways to keep up with the 
best existing methods, identify the progress over time, and recog- 
nize where future efforts may be most productively focused. One 
way is to be aware of CASP results (the Critical Assessment of 
Protein Structure Prediction, www.predictioncenter.org) con- 
ducted every 2 years since 1994. Another way is to check the 
Continuous Automated Model EvaluatiOn (CAMEO; www. 
cameo3d.org) project that provides weekly follow-ups for three 
different aspects of the prediction by web servers: (a) homology 
modeling, (b) model quality estimation, and (c) contact prediction. 

In recent years, machine learning approaches have contributed 
tremendously to improve the accuracy of structural prediction, 
even when no similar structure is known [11]. Particularly in the 
recent CASP14, the AlphaFold2 method [11] outperformed most 
methods by predicting structures with high accuracy. 


The methods referred to as template-based modeling include 
threading techniques and comparative modeling. Template-based 
modeling predicts the 3D structure of a query protein through the 
sequence alignment between the query and one or several proteins 
with known structures. When query and template sequences have 
been derived from a common ancestor, the method is referred to as 
homology modeling. However, proteins from different evolution- 
ary origins may still adopt a similar structure; in this case, threading 
methods are used to identify structural templates. 

Generally, the process of comparative modeling involves four 
steps: (a) template identification, (b) sequence alignment, 
(c) model building, and (d) model refinement and validation. If 
the model is not satisfactory, some or all of the steps can be 
repeated. As such, the success of homology modeling depends on 
the ability to identify the closely homologous templates based on 
sequence identity and to generate an accurate query-template 
alignment. The goal of the alignment is to map_ the 
one-dimensional target sequence onto corresponding three- 
dimensional positions of the template structure correctly, ideally 
with only substitutions and small insertions/deletions. Broadly 
speaking, comparative modeling produces a good result if the 
query-template alignment has a global sequence identity 230%. 
As the sequence identity decreases, a correct template identification 
is more difficult and prone to misaligned regions. When query- 
template sequence identity is between 20% and 30%, they fall in the 
twilight zone; the evolutionary relatedness of proteins becomes 
uncertain [12, 13]. In this case, the threading technique may help 
to identify remote homology, leaving the ab initio method as the 
last alternative for protein structure prediction. 
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2.2 Template-Free or 
Ab Initio Methods 


2.3 Servers for 
Protein Structure 
Prediction and Related 
Databases 


2.3.1. MODELLER via 
ModWeb and ModBase 


For query proteins that have no structurally related protein in the 
PDB library, the structure must be built from scratch. This proce- 
dure is called ab initio modeling, de novo modeling, or template- 
free modeling. An ab initio method conducts an exhaustive search 
to identify the minimum energy conformation through optimiza- 
tion algorithms, such as Monte Carlo [14] or molecular dynamics 
[15], using knowledge-based scoring or physics-based energy func- 
tions. This procedure generates several putative conformations 
(also called decoys), and final models are selected from them. A 
successful ab initio modeling depends on three factors: 


(a) An accurate energy function that scores the native structure of 
a protein as being the most thermodynamically stable state, 
compared to all possible decoy structures 


(b) An efficient search method that can quickly identify the 
low-energy states through conformational search 


(c) A strategy that can select near-native models from a pool of 
decoy structures 


Hereafter are presented some servers and databases used for protein 
structure prediction based on various strategies and using, in some 
cases, sequence coevolution information and artificial intelligence- 
derived methods. 


MODELLER is one of the most widespread comparative modeling 
methods for prediction of protein structures [16]. Models are 
obtained by satisfying spatial restraints derived from the query- 
template alignment. 

These restraints include: 


(a) Ca-Ca and backbone N-O distances and dihedral angles 
restraints 


(b) Stereochemical restraints from the CHARMM-22 force field 


(c) Statistical preferences for dihedral angles and non-bonded 
inter-atomic distances derived from representative sets of 
known protein structures 


Optionally, it is possible to add manually additional restraints. 
MODELLER is available free of charge only to academic nonprofit 
institutions at https: //salilab.org/modeller/. 

Several servers based on MODELLER have been developed 
such as ModWeb or ModBase. 

ModWeb server (https://modbase.compbio.ucsf.edu/mod 
web/) offers the possibility to use MODELLER online. 

ModBase (http: //salilab.org/modbase) is a database contain- 
ing fold assignments, sequence-structure alignments, models, and 
model assessments for all sequences related to a known structure 
[17]. The models are derived by ModPipe, an automated modeling 
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2.3.2 PHYRE2 


2.3.3 I-TASSER 


pipeline relying on the programs PSI-BLAST [18] and MODEL- 
LER. ModBase also includes binding site prediction for small 
ligands and a set of predicted interactions between pairs of modeled 
sequences from the same genome that are predicted to interact with 
each other. 


PHYRE2 (http://www.sbg.bio.ic.ac.uk/phyre2) is designed to 
predict a protein three-dimensional structure from a protein 
sequence [19]. The server uses a powerful strategy to detect remote 
homology combining PSI-BLAST alignment with hidden Markov 
models (HMM) via HHsearch for template detection. The primary 
algorithmic strategy is composed of four steps. In the first step, 
homologous sequences of the query are searched using HHblits. 
The resulting alignment is used to predict secondary structure. In 
the second step, HHsearch is performed against a database of 
HMMs of protein of known structures. The top-scored alignments 
are used to construct the protein model backbone. In the third 
step, the loops are modeled, and in the last step, the side chains are 
added to generate the final model. When the intensive mode is 
used, a step is added to use an ab initio folding simulation called 
Poing” to model regions of the query protein with no detectable 
homology to known structures. 


I-TASSER (Iterative Threading ASSEmbly Refinement) is a hierar- 
chical approach to protein structure and function predictions from 
their amino acid sequences [20]. I-TASSER is accessible via a web 
server (https://zhanglab.dcmb.med.umich.edu/I-TASSER) and a 
stand-alone package. Starting from an amino acid sequence, the 
algorithm tries to retrieve protein templates of similar fold from the 
Protein Data Bank (PDB: https://www.rcsb.org) using a meta- 
threading approach called LOMETS (https://zhanggroup.org/ 
LOMETS/). In the next step, the continuous fragments taken 
from the PDB templates are reassembled into full-length models. 
For cases where no appropriate template is identified, I-TASSER 
builds the whole structure by ab initio modeling. SPICKER iden- 
tifies the low free-energy states through clustering the simulation 
decoys (https://zhanggroup.org/SPICKER/). In the third step, a 
second iteration of the fragment assembly simulation is performed 
again to remove the steric clash and refine the global topology of 
the cluster centroids. The decoys generated are then clustered, and 
the lowest energy structures are selected followed by an optimiza- 
tion of the hydrogen-bonding network. The final model is used to 
predict the protein biological function by matching the model with 
other known proteins using the enzyme classification 
(EC number), gene ontology vocabulary, and ligand binding 
sites. More recently, an I-TASSER-derived method called D-I- 
TASSER has been developed for distance-guided protein structure 
prediction  (https://zhanggroup.org//D-I-TASSER/). This 
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2.3.4 trRosetta 


2.3.5  AlphaFold2 Method 
and Structural Database 


method integrates inter-residue contacts predicted by deep neural 
network and has been reported to significantly enhance accuracy of 
models compared to I-TASSER. 


trRosetta (transform-restrained Rosetta) is an algorithm for protein 
structure prediction using a deep neural network to predict the 
inter-residue distances [9]. The algorithm is available in a stand- 
alone version and a web server (https://yanglab.nankai.edu.cn/ 
trRosetta/). The input is the amino acid sequence or a multiple 
sequence alignment of the query protein. A deep neural network is 
applied to predict the inter-residue distances and orientation dis- 
tributions between residues. Some of the features used in the con- 
volutional layers of the networks include amino acid frequencies, 
entropies, and coevolutionary couplings. 

Predicted inter-residue distances and orientations are used as 
restraints to guide the Rosetta method to build three-dimensional 
structure models based on direct energy minimization. 

Recently, the algorithm was modified to include the option to 
use templates. It is recommended to run the algorithm including 
homologous templates, which are used to add restraints to Rosetta. 


Given a query sequence, AlphaFold2 [11] searches for related 
sequences in three databases: UbiRef¥0, BFD, and MGnify. 
Then, potential templates are searched using HHsearch against 
the PDB70 database [21]. The input sequence, multiple sequence 
alignment, and template hits are used as inputs for the deep 
learning-based method that produces a variety of predictions 
including distances, torsions, and atom coordinates. Then, the 
predicted 3D model is relaxed using restrained gradient descent 
with the Amber ff99SB force field [22] integrated in 
OpenMM [23]. 

AlphaFold2 produces a per-residue confidence metric called 
the predicted local distance difference test (pPLDDT) on a scale 
from 0 to 100, to estimate how well the prediction agrees with an 
experimental structure considering the Ca. A pLDDT >90 is con- 
sidered as a highly accurate prediction; in addition to a good 
backbone prediction, the side chains are often correctly oriented 
(yl rotamers are 80% correct). Regions with pLDDT between 
70 and 90 indicate a generally good backbone prediction. Regions 
with pLDDT between 50 and 70 are low confidence and should be 
treated with caution. Finally, regions with pLDDT <50 are proba- 
bly disordered. 

In CASP14, AlphaFold2 was the top-ranked protein structure 
prediction method, producing predictions with high accuracy [24]. 

The source code of AlphaFold2 is available on GitHub 
(https: //github.com/deepmind/alphafold). It is also possible to 
use AlphaFold2 via the Google ColabFold notebooks [25], a free 
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platform for protein folding that does not require any installation 
or expensive hardware. Several ColabFold notebooks are available 
on GitHub (https: //github.com/sokrypton/ColabFold). 

DeepMind and EMBL’s European Bioinformatics Institute 
(EMBL-EBI) created the AlphaFold database (https: //alphafold. 
ebi.ac.uk) to provide open access to protein structure predictions 
generated by the AlphaFold2 method. At the moment, the predic- 
tions cover almost the entire human proteome [26] and the 
proteomes of several other key organisms such as E. coli, fruit fly, 
mouse, and zebrafish, among others, totaling over 350,000 protein 
structures. The database provides three outputs from AlphaFold2: 
the three-dimensional coordinates, the per-residue confidence met- 
ric pLDDT, and the Predicted Aligned Error, which is necessary to 
assess confidence in the domain packing and large-scale topology of 
the protein. 


3 Protein-Protein Interaction Prediction Using Coevolution 


We refer to molecular coevolution when a change in one locus 
affects the selection pressure at another locus, and this change is 
reciprocal [27, 28]. In other words, when a mutation occurs in a 
particular position, another mutation may occur to compensate for 
the change or restore the protein function. As coevolving residues 
tend to be close in the tridimensional structure, coevolution has 
been successfully applied to predict intra- and inter-protein residue 
contacts [29-32]. When coevolution methods were applied at 
whole-proteome scale combined with structure modeling to pre- 
dict protein-protein interactions, the accuracy of interaction pre- 
diction is higher than the proteome-wide two-hybrid and mass 
spectrometry screens [33]. A large panel of methods exists to 
predict molecular coevolution; all of them use a multiple sequence 
alignment (MSA) as input. In general, a large number of diverse 
sequences are required to obtain reliable results. To predict inter- 
protein coevolution between two proteins A and B, the real input of 
the coevolution algorithm is the concatenated alignment; protein A 
and protein B for each organism must be properly paired (Fig. 1). 
Building the concatenated alignment is not straightforward, 
because each row of the MSA should contain a pair of interacting 
proteins out of two protein families. That means that it is desirable 
to concatenate orthologous proteins, as they are likely to perform 
an equivalent function, rather than other types of homologs. 

The I-COMS web server (http://i-coms.leloir.org.ar) allows 
computing inter-protein contact prediction using four different 
covariation methods [34]. The server gives the option to provide 
the concatenated alignment or build it automatically. The server 
includes four covariation methods: corrected mutual information, 
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Fig. 1 Inter-protein coevolution. In the concatenated alignment between two interacting proteins A and B, two 
positions coevolve (indicated with an arrow) to maintain favorable interactions between physically interacting 
amino acid residues (indicated as *) in the three-dimensional structure 


mfDCA, PSICOV, and CCMpred. Intra- and inter-protein results 
are provided in an interactive visualization allowing the comparison 
between methods as well as the concordance between results. 
Covariation positions can be calculated for up to five proteins. 


4 Protein Assembly Prediction and Analysis 


4.1 Protein-Protein 
Docking: Principles 
and Methods 


When the structural information of different protein partners is 
available through experimental data or modeling, the docking 
approach is used as a standard method to predict the potential 
interactions. The aim of docking is to find the best matched 3D 
structure of the protein complex among several protein models. To 
do so, a fast search algorithm is used to sample all possible spatial 
conformations, and a scoring function is needed to rank the solu- 
tions. Due to the large number of possibilities for the position and 
angle of protein residues, spatial search algorithms in protein- 
protein docking can be divided into three main categories: 
(a) exhaustive global search including fast Fourier transform 
(FFT)-based search implemented [35, 36] and spherical Fourier 
transform-based search [37-39], (b) randomized search using 
Monte Carlo [40, 41], and (c) local shape feature matching includ- 
ing geometric hashing [40]. It is important to notice that all 
FFT-based approaches perform rigid-body docking because the 
related grid cannot be updated, unlike randomized search 
algorithms. 

Protein-protein docking methods typically generate thousands 
of potential solutions for a particular complex. To discriminate 
near-native solutions, the development of a scoring function is 
needed and is still challenging. These scoring functions can be 
divided into several categories, sometimes combined: (a) physics- 
based scoring function capturing the determinants related to the 
stability of protein-protein complexes, e.g., shape complementary, 
van der Waals, electrostatics, and desolvation potential [41-47], 
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4.2 ZDOCK 


4.3 InterEvDock3 


(b) knowledge-based functions taking advantage of the informa- 
tion from available structures [48-51], (c) scoring functions com- 
bining physical terms with knowledge-based terms [52-55], 
(d) evolutionary scoring function based on the protein sequence 
evolution [56, 57], and (e) consensus-based scoring functions 
seeking to identify solutions with high occurrence features, inde- 
pendently of any physics-based or evolutionary evaluation, such as 
conservation of interface contacts [58-62]. Along the same line as 
the CASP contest for protein structure prediction, the CAPRI 
competition allows a blind assessment of the most recent methods, 
offering an updated view of progress in the field [63-65 ]. 


ZDOCK is a protein-protein docking method available through an 
online web server (https://zdock.umassmed.edu/) [66]. It uses 
the fast Fourier transform algorithm to enable an efficient docking 
search. It is a user-friendly server to predict complexes that proceed 
in three steps. The first step is to provide two input structures 
(by PDB code or PDB file) and choose the ZDOCK version. The 
second step is the selection of blocking or contacting residues for 
each protein submitted. The last step is the result analysis and 
visualization, including the top ten docking models. 


InterEvDock3 (https://bioserv.rpbs.univ-paris-diderot.fr/ 
services/InterEvDock3/) is a server designed for predicting pro- 
tein pairwise assemblies, based on sequence or on structure, and 
possibly combined with coevolution data [67]. Three protocols are 
implemented to use at best the available information. 

The first method is template-based docking; it uses sequences 
to search the protein assembly with already known structures. 
Template-based docking protocols need two or more sequences 
and a protocol search among a list of interacting proteins if the 
structure of protein homologs is available in complex with partners, 
based on HHsearch. The structural assembly is built with threading 
for the main parts, and the missing parts are built with the 
DaReUS-Loop program [68]. 

The two other methods perform free docking using the 
FRODOCK software. Then, generated models are ranked accord- 
ing to the coevolution information given by the user or computed 
by the server. 


5 Case Study: Modeling the Succinate-Quinone Oxidoreductase Heterocomplex 


We propose to build a structural model of the supramolecular 
complex succinate-quinone oxidoreductase (SQR). SQR is a key 
enzyme in the Krebs cycle, oxidizing succinate to fumarate and 
reducing quinone to quinol, acting as a link between the Krebs 
cycle and the respiratory chain. Escherichia coli SQR has four sub- 
units, two hydrophilic subunits exposed to the cytoplasm (SdhA 
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5.1 Building a 3D 
Model Using 
AlphaFold2: SQR 
Subunits, SdhA, 
and SdhC 


and SdhB), which interact with two hydrophobic membrane- 
intrinsic subunits (SdhC and SdhD) [69]. Interestingly, SdhA and 
SdhB have already been shown to coevolve together. This informa- 
tion enabled to predict the proper interacting interface [29-32 ] 
compared to the crystallographic protein structure of E.coli SQR 
[70, 71] (PDB code: INEK, 2WDQ). 

For pedagogical purposes, we provide step-by-step instructions 
to generate the structural models of the heterotetramer subunits 
and their assembly (Fig. 2). First, we shall build a structural model 
for all subunits (SdhA, SdhB, SdhC, and SdhD) using either the 
AlphaFold2 method without template or I-TASSER without using 
close templates. This choice will mimic cases where no crystallo- 
graphic information is available. Second, we will use inter-protein 
coevolution detection to predict residue contacts between the sub- 
units. The dataset for coevolution comes from the available data 
reported in reference [30] and is provided in supplementary infor- 
mation (SI1). Third, the predicted residue contacts will be used to 
guide the protein-protein docking. Fourth, a docking was carried 
out between the dimers SdhA-SdhB and SdhC-SdhD without 
using coevolution information. 


To avoid setting up AlphaFold2 on your local computer, we will use 
an online version to build the 3D models of SdhA and SdhC. The 
following steps are the same for SdhA (UniProt ID: POAC4) and 
SdhC (UniProt ID P69054): 


1. Download the amino acid sequence of the target in FASTA 
format from UniProt. 


2. Go to ColabFold repository (https://github.com/sokrypton/ 
ColabFold). 


3. Choose the Notebook AlphaFold2 (from DeepMind). 


4. Execute the first two cells by clicking the play button. It will 
install the required programs in the cloud, and not on your 
computer. 


5. Wait until the task is completed, a green tick mark will appear at 
the left of the play button. You can also visualize the progres- 
sion of each task in the progress bar (Fig. 3a). 


6. Paste the protein sequence without the FASTA header in the 
text box. 


7. Select Runtime -> Run After in the toolbar at top of screen. 
8. Unzip the file downloaded automatically with the results. 
9. It’s done! Now, we are ready to analyze the results. 
To make sure that you can reproduce the result, it is recom- 
mended to save a copy of the notebook on your computer. You can 


find several options to save the notebook in the Fi/e menu in the 
top bar. 
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Fig. 2 Strategy to model the heterocomplex succinate-quinone oxidoreductase (SQR). The complex model was 
built as follows. First, we shall build a structural model for all subunits (GSdhA, SdhB, SdhC, and SdhD) using 
either the AlphaFold2 method without template or I-TASSER without using close templates. Second, we will 
use inter-protein coevolution detection to predict residue contacts between the subunits. The dataset for 
coevolution comes from the available data reported in reference [32] and is provided in supplementary 
information (SI1). The inter-protein contact prediction was carried out using I-COMS. Third, the two subunits 
were docked using InterEvDock3 with coevolution information, and in the fourth step a docking was carried 
out between the dimers using ZDOCK without coevolution information 


To analyze the results, we will visualize two parameters: (a) the 
number of sequences and gaps for contact prediction (Fig. 3b) and 
(b) the AlphaFold per-residue confidence score (pLDDT) that is 
found in the B-factor fields of the coordinate files (Fig. 3c). Both 
sequence information and pLDDT score per residue provided on 
average a good confidence about the quality of 3D models 
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Fig. 3 Building a 3D model of SdhC from E. coli using AlphaFold2. Following the ColabFold notebook running 
process (a). Coverage of the multiple sequence alignment used by AlphaFold2 (b). Structural model colored by 
pLDDT (c). The AlphaFold2 method predicts a bundle of transmembrane helices and a disordered/coil region in 
N-term. In this latter, a low confidence is determined due to the lack of information in this region (N-term 


region in B) 


5.2 Building a 3D 
Model Using I-TASSER: 
SOR Subunits, SdhB, 
and SdhD 


(SdhA and SdhC). To confirm this result, both 3D models were 
compared with the corresponding X-ray structures (PDB code: 
INEK chain A and C). Using TM-align server (https:// 
zhanggroup.org/TM-align/), structural alignments between 
models and solved structures gave RMSD values of 0.73 A and 
1.33 A for SdhA and SdhC, respectively. It is worth noting that 
these RMSD values correspond to aligned residues; thus these 
latter can increase when considering the whole structure as the 
loop/coil/disordered regions highlighted in Fig. 4. 


To avoid installation and set up programs on your computer, we 
will use the widely used I-TASSER webserver to build the 3D 
models of SdhB and SdhD. 


1. Register yourself (https: //zhanggroup.org/I-TASSER/regis 
tration.html). 


2. Download the amino acid sequence of the target in FASTA 
format from UniProt (UniProt ID: P07014 and POAC44 for 
SdhB and SdhD, respectively). 


3. Go to I-TASSER webserver (https://zhanggroup.org/I-TASSER/). 

4. Paste the protein sequence in FASTA format in the text box 
(Fig. 5a). 

5. Type 60% to exclude homologous templates in the Option II 
section. 

6. Identify you with email and password. 

7. Click on the “Run I-TASSER” box. 


8. Wait for results sent by email. 
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Fig. 4 Structural comparison between X-ray structure (Inek) and 3D models from AlphaFold2. SdhA (a) and 
SdhC (b) structures are shown in cartoon and colored as in Fig. 2. X-ray structures are displayed in transparent 
gray cartoon representation. Red squares highlight the main regions where Alphafold2 differs from the X-ray 
structure 


5.3 Modeling SdhA- 
SdhB and SdhC-SdhD 
Using Protein-Protein 
Docking and 
Coevolution 
Information 


To analyze the results, we will visualize two parameters: (a) the 
threading templates used by I-TASSER and the alignment quality 
against the target sequence (Norm Z-score) (Fig. 5b) and (b) the 
I-TASSER score (c-score) that gives the confidence of each model 
based on the significance of threading template alignments and the 
convergence parameters of the structure assembly simulations 
(Fig. 5c). This score is comprised between —5 and 2, with higher 
values (close to 2) indicating a higher confidence on the 3D 
model and vice-versa. Both templates and C-score (1.23 and 0.53 
for SdhB and SdhD, respectively) provided good confidence about 
the quality of 3D models. Indeed, the best C-score was obtained 
using the templates chain B and C from 1YQ3 for SdhB and SdhD, 
respectively. Even if the sequences from 1YQ3 share ~50% and 20% 
of identity with SdhB and SdhD, respectively, the selected template 
corresponds to the same functional complex from another organ- 
ism (Gallus gallus). To confirm this result, both 3D models were 
compared with the corresponding X-ray structures (PDB code: 
1NEK chain B and D). Using the TM-align server, structural 
alignments between models and solved structures gave RMSD 
values of 2.01 A and 2.19 A for SdhB and SdhD, respectively. 


Among the six possible protein pairs composing the heterotetra- 
mer, we focused on the prediction of SdhA-SdhB and SdhC-SdhD, 
the first pair corresponding to the cytosolic subunits and the second 
one to the membrane domains. We will use inter-protein coevolu- 
tion to predict contacts between these two subunit pairs using 
I-COMS server. The input will be the alignments taken from a 
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Estimated RMSD = 3.242.2A 


Fig. 5 Building a 3D model of SdhD from £. coli using |-TASSER. Following the submission process 


described between steps 4 and 7 (a). Top ten of threading templates (b). Best 3D model out of the top five 
final models (c) 


previously published and publicly available dataset and provided in 
supplementary information (SI1). 


1. Download the alignments from SI1. 


2. Go to the I-COMS server (http://i-coms.leloir.org.ar/index. 
php). 


3. Select the option “Upload your own alignments.” 

4. Optionally, you can describe the uploaded dataset. 

5. Upload the two alignments using the “Browse...” button. 

6. Click on “Upload and submit.” 

7. Choose the method for coevolution: plmDCA. 

8. Optionally you can indicate the job description and your email 
address. 


Results include information about the alignment used, such as 
the number of sequences and clusters. If the number of clusters is 
low (<400), it means that there is little diversity in the MSA and the 
results should be interpreted with caution. Results are shown in a 
circos representation of the covariation scores of each of the 
selected methods, and protein pairs are displayed (Fig. 6). 
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Fig. 6 Docking of subunits using coevolution information. Top five inter-protein coevolution results from 
I-COMS server. The inner circle represents the sequence positions in boxes colored according to the sequence 
they belong to (SdhA or SdhB). The correlated mutation scores are represented as lines between positions in 
the center of the circle. Given as example, the coevolving positions K38 and R52 from SdhA and SdhB, 
respectively, are indicated (a). Top five inter-protein coevolving positions are shown in the modeled subunits; 
the Ca of coevolving positions are shown in sphere representation (b). Analogous results are given for 
subunits SdhC and SdhD, the top five coevolution results (c) and the same coevolving pairs mapped on the 
models (d) 


To visualize the inter-protein results: 
1. Choose the pair of proteins (SdhA vs SdhB) or (SdhC vs SdhD). 
2. Select the method. 


3. Click on “Draw Circos.” 
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5.4 Modeling the 
Succinate-Quinone 
Oxidoreductase 
Heterocomplex Using 
Protein-Protein 
Docking and 
Restraints 


4. Click on “Inter-protein” links. 
5. You can select the number of edges to visualize. 


6. Download covariation raw data, it will be used in the next steps. 


Protein docking of SdhA-SdhB and SdhC-SdhD will be per- 
formed using InterEvdock3 server and residue contact predictions 
from I-COMS as described previously. The inputs will be the pdb 
files of the two partners to dock and a list of residue pair contacts. 


7. Go to InterEvdock3 server (https://mobyle.rpbs.univ-paris- 
diderot.fr/cgi-bin/portal.py#forms::InterEvDock3). 


8. Upload Partner A and Click on “Browse...” to browse and 
select pdb file. 


9. Upload Partner B and Click on “Browse...” to browse and 
select pdb file. 


10. Click on “Advanced Options.” 
11. Go to “Use of co-evolution or deep-learning maps.” 


12. Upload the coevolution map (Top 100) from I-COMS given 
in S12. 


13. Select “Yes” in “Minimize the output models using 
gromacs.” 


14. Click on run. 


InterEvdock3 web portal enables to follow the job progress at 
any time without any specific link. The https link associated with the 
job can be stored locally for caution. 

Main InterEvdock3 output provides two kinds of rankings 
limited to the top ten poses: (a) based on the number of structural 
contacts matching the predicted coevolution pairs and (b) based on 
the scoring function related to the sum of the best predicted 
coevolution pairs. 

In this study, the best docking poses for both heterodimers are 
selected from the second type of ranking, which leads to favor the 
most probable pairs related to their coevolution score. The result- 
ing models of the heterodimers are provided in SI3. 


As there is not enough information when merging concatenated 
MSA from SdhA, SdhB, SdhC, and SdhD, coevolution cannot be 
used to predict residue contacts. Therefore, the docking between 
the predicted partners will be done using “classical” docking. Free 
docking and docking with restraints will be performed using 
ZDOCK server. To avoid clashes and improve docking prediction, 
N-term disordered regions for SdhC and SdhD are removed, 
corresponding to the first 13 residues and the 10 residues, 
respectively. 
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6 Conclusions 


1. Go to https://zdock.umassmed.edu/. 


2. Choose “PDB file” in the scrolling list close to “Input Protein 
1” keyword. 


” 


3. Click on “Browse ... 
SdhA-SdhB. 


4. Repeat steps 2 and 3 for Input Protein 2. 


to select PDB file corresponding to 


5. Fill up the form “Enter your email.” 


6. Optionally, for free docking, check the box close to Skip residue 
selection. 


7. Click on “Submit” button. 


8. If Skip residue selection was not checked, select interactively the 
residues belonging to the binding site for guiding docking. 


9. Click on “Submit” button. 
10. Wait for results sent by email. 
11. Download top ten predictions. 


12. Select the first docking poses. 


This particular case seems to be difficult for good docking 
prediction. Indeed, free docking does not provide a good solution 
compared to the X-ray structure. To get a correct assembly, a list of 
17 and 19 residues from SdhB and SdhC-SdhD (given in SI2) had 
to be provided to guide the docking. The binding residues at the 
interface can be selected on distance threshold criteria, 3.2 A on 
heavy atoms from X-ray structure in this work. Having this kind of 
information helps to have better predictions as shown in Fig. 7. 
Superposition of the modeled heterotetramer onto the X-ray struc- 
ture (PDB code: 1NEK) showed an RMSD of ~0.73 A based on 
TM-align server, indicating a very good fit. The coordinate file of 
the final model is provided in SI3. 


Overall, this study shows that protein complex prediction is not a 
trivial question. The first crucial work is to obtain the structure of 
each protein partner. According to the available data, different 
approaches can be applied with a new methodology outperforming 
the others, called Alphafold2. Part of the success in the assembly 
construction will first depend on the quality of the 3D structural 
model of each partner. Therefore, assessment such as pLDDT is an 
important step at this stage. Then, protein-protein interactions can 
be predicted with reasonable confidence when diverse information, 
such as coevolution prediction or experimental results, is available 
to guide toward the most probable assembly. In this study, both 
cases are exemplified. Two heterodimers were quite well predicted 
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Fig. 7 Superposition of the modeled SDQ heterotetramer onto the reference 
structure. Each modeled subunits SdhA, SdhB, SdhC, and SdhD is shown in 
cartoon representation and is colored according to the corresponding label. The 
heterotetramer is obtained from the docking of the two main units SdhA-SdhB 
and SdhC-SdhD. The reference corresponds to the X-ray structure (PDB code: 
1NEK), which is shown in white cartoon representation for clarity 


using coevolution information thanks to the diversity of the data. 
However, construction of the heterotetramer assembly was quite 
challenging because the interactions with the membrane are not 
taken into account in the docking procedure. To circumvent this 
limitation, a set of amino acid residues from the protein interface 
identified from experimental data was used to guide the construc- 
tion of the heterotetramer assembly. 
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Chapter 5 


From Genome Mining to Protein Engineering: A Structural 
Bioinformatics Route 


Derek J. Smith 


Abstract 


This chapter outlines applications in genome mining, along with computational methods to predict protein 
structure and protein-ligand docking. It offers a simple computational route to rapidly identify proteins of 
interest from genomic and proteomic data, to accurately predict their three-dimensional structures, and to 
dock small molecules to their binding pockets and strategies to improve their biophysical properties 
depending on the needs of the experimental researcher. 


Key words Genome mining, Protein structure prediction, Structural bioinformatics, Small molecule 
docking, Protein engineering, Directed evolution 


1. Introduction 


1.1. Genome Mining: 
Finding Your Needle in 
a Haystack 


The recent rise of rapid genome sequencing and annotation has 
produced a wealth of data, almost dazzling in its scope, for experi- 
mental and theoretical researchers. Genomes can be parsed to 
identify sequences for potential roles in pharmaceuticals or the 
chemical industry. Also, advances in both protein structure predic- 
tion methods, particularly in the areas of artificial intelligence and 
deep learning, have seen the production of almost atomic-level 
accuracy models. Coupled with this, the use of protein-ligand 
scoring functions and rapid docking allows the researcher tools 
for the interrogation of increasingly accurate predictive models 
for the interaction of proteins with potential drugs or fine chemical 
substrates. 


Advances in modern gene sequencing technologies have enabled 
the production of a vast amount of biological sequence data in a 
very short time. The researcher keen to understand the origin of his 
or her metabolite of interest is often now able to decode the 
biosynthetic pathway through genome mining, given a fully 
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sequenced genome of the organism. Genome mining, as a subset of 
data mining, involves the interrogation of genomic sequences, 
based on knowledge of enzymatic reactions and conserved protein 
sequences. It can be used to identify novel sequences that encode 
for specific enzymatic activity towards metabolites. It enables an 
in-depth understanding of metabolite biosynthesis, all the way 
from single enzyme reactions to full biosynthetic pathways and 
their means of regulation [1]. 

Many tools are available for the interrogation of genome, 
transcriptome, and proteome sequences, but here we will focus 
on three standard tools looking at different levels of data. The 
most common sequence search tool is the Basic Local Alignment 
Search Tool (BLAST) software [2]. BLAST allows rapid compari- 
son between biological sequences (nucleotide and protein) and can 
identify both long and short regions of similarity between a query 
sequence and searchable sequence databases. The BLAST software 
is actually a suite of programs for dealing with sequence comparison 
for protein-protein searching (blastp) and nucleotide-nucleotide 
searching (blastn). For more sensitive searching, blastx translates a 
nucleotide query to protein, tblastn translates a nucleotide database 
to protein, and tblastx translates both nucleotide query and data- 
base to protein, all of these for protein-protein searching. It is ideal 
for the identification of single sequences. 

A more sophisticated search program is HMMER (pronounced 
“hammer”) [3]. HMMER is also a suite of programs, but instead of 
using individual sequences for search queries, it uses sequence 
profiles. Evolutionarily related sequences are aligned and used to 
encode profiles along the sequence in the form of a hidden Markov 
model (HMM). These profile models represent the likelihood of 
insertions and deletions along the sequence, as well as conserved 
sequence positions and blocks, and are peculiar to individual 
sequence families. This HMM method enables a very sensitive 
scoring and identification of entire sequence families in a genome 
and has been used to identify sequences with novel functions in 
both genomes [4] and transcriptomes [5 ]. 

As well as locating single sequences and families, one further 
level of searching that is relevant to this discussion is the identifica- 
tion of biosynthetic gene clusters (BGCs). For the production of 
natural products and other secondary metabolites, organisms tend 
to organize their biosynthesis genes into clusters, with the most 
well-known being the observance of operons (long lengths of DNA 
encoding entire biosynthetic pathways) in bacteria [6]. This is also 
seen in eukaryotes such as fungi [7], and even in plants, where 
genes for specific biosynthetic pathways are seen to be co-localized 
(e.g., terpene synthases and P450s, phenylpropanoids, alkaloids, 
and plant defense compounds [8]). One commonly used program 
for identifying BGCs is the antibiotics and Secondary Metabolite 
Analysis SHell (antisMASH) [9]. antiSMASH is available as both a 
web server and as a downloadable stand-alone program. It can be 
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used to identify, annotate, and compare potential BGCs from bac- 
teria, and versions are also available for fungal and plant genomes. 
An uploaded genome first undergoes gene prediction, followed by 
gene cluster identification. It also can predict the chemical struc- 
tures of potential products and perform protein domain analyses 
and comparisons with related clusters from other organisms. 


Having obtained the sequence(s) of interest, cloning and expres- 
sion to identify properties and activity are next. One aspect that is 
often neglected in a molecular biology environment is to try to 
obtain the structure — or at least a reasonable model — of the 
expressed protein. This is key for understanding activity: the fumc- 
tion of the protein is itself a function of the three-dimensional 
structure of that protein. For example, an enzyme works by folding 
in such a way as to bring catalytic amino acids together in space, or 
by binding a reactive cofactor molecule, together with a chemical 
environment (polar, charged, hydrophobic, or a combination of 
these) that permits the binding of a chemical substrate, a reaction, 
and then egress of the products. 

The problem arising is that it is not easy to predict the struc- 
tures of proteins de novo. Levinthal’s paradox suggests that an 
astronomical amount of time would be required to fold the protein, 
and as the folding actually occurs much faster than this, the protein 
folds not sequentially but through intermediate states [10]. 

The most common ways of experimentally determining protein 
structures are x-ray crystallography and nuclear magnetic resonance 
(NMR). These standard tools are excellent ways of obtaining struc- 
tures but often involve a lot of experimental effort. For x-ray 
crystallography, these include growing regular crystals that diffract, 
as well as solving the phase problem through molecular replace- 
ment or heavy atom isomorphous replacement techniques. In the 
case of NMR, there is often the requirement of expensive enriched 
isotopes, which must be incorporated into the expressed protein. 

Protein structure prediction through computational methods 
is a faster route to get to a model protein structure. Historically, 
protein structure prediction has been dominated by what is often 
referred to as homology or comparative modeling. Chothia and Lesk 
[11] identified that evolutionally related sequences are likely to 
have similar structures, and the closer the sequences, the closer 
the structures, especially in the conserved core regions. This 
became an early basis for comparative modeling of protein 
sequences, which can be outlined as follows: 


A. Starting with a target sequence of interest, the Protein Data 
Bank (PDB) of deposited 3D structures [12] is searched using 
BLAST for template structures related to the target sequence. 
If no related structure is found, a threading algorithm may be 
used to try to identify more distantly related structures [13]. 
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1.2.1 Programs and 
Servers for Comparative 
Modeling 


1.2.2 The Rise of 
AlphaFold 


B. These structures are aligned to the target sequence to produce 
a sequence-structure alignment. This allows the identification 
of structurally conserved regions (SCRs) of main chain back- 
bone and structurally variable regions (SVRs) which corre- 
spond roughly to loop regions. 


C. The model is constructed by copying the SCRs, along with 
conserved sidechains to form the basis of the model. This is 
known as fragment matching [14]. Missing loops are either 
added from loop libraries or constructed de novo, and then 
sidechains are added, usually from rotamer libraries of 
observed sidechain conformations. Another method of con- 
struction, known as segment matching [15], relies on copying 
short lengths of conserved sequence-structure regions across 
the whole model. A third method, the satisfaction of spatial 
restraints, identifies geometric features of the template struc- 
tures and converts them to probability density functions. 
These are then optimized across the whole model to generate 
a 3D representation similar to the protocol used in NMR 
protein structure determination [16]. 


D. The model is then refined, usually by energy minimization or 
molecular dynamics, and assessed by stereochemical fidelity to 
known protein structures, as well as the use of knowledge- 
based potentials of protein folding [17]. If the model is found 
to have errors, the cycle can be repeated, with changes in 
template structures, or altering the sequence-structure 
alignment. 


Many comparative modeling programs are available for use for the 
researcher. One of the most popular stand-alone programs is 
MODELLER [16]. This was the original method of protein mod- 
eling by satisfaction of spatial restraints and is still a rapid and 
reasonably accurate method of obtaining structural models. It 
does not have a GUI but can be used as part of the PyMod plugin 
[18] for PyMOL visualization software [19]. This allows for a 
complete sequence searching and structure modeling package for 
academic use. 

Automatic structure prediction servers can also produce high- 
quality models, such as HHpred [20], I-TASSER [21], Robetta 
[22], and RaptorX [23], which are also evaluated on a weekly basis 
using the Continuous Automated Model EvaluatiOn server 
(CAMEO) [24]. Most of these server-based programs can also be 
downloaded for individual or group usage. 


The recent success of the AlphaFold2 program at the 14th Critical 
Assessment of Techniques for Protein Structure Prediction 
(CASP14) made headlines worldwide [25]. AlphaFold directly 
predicts protein structure using only the target sequence and a 
multiple sequence alignment as inputs. The inputs first go through 
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a neural network block known as Evoformer, which processes the 
alignment and target sequence into arrays. These then enter the 
structure module, where an explicit 3D representation of each 
residue is optimized through rotation and translation to obtain a 
highly accurate model. The assessors of CASP14 considered the 
accuracy of AlphaFold2 for nearly two-thirds ofits predictions to be 
competitive with that of experimental methods (~1 A deviation for 
the protein backbone). Other related programs such as trRosetta 
[26] and RoseTTAFold [27] also offer highly accurate structure 
models. 


Having obtained the protein structure model, the researcher may 
now require the use of docking software to dock small molecules to 
the protein. Many proteins interact with small molecules and are 
involved in processes including catalysis and signal transduction. If 
the researcher is modeling an enzyme of interest, small molecule 
substrates, products, and inhibitors may be of value to include 
within the model for research. The goal is to predict the preferred 
orientation of the small molecule (or “ligand”) relative to the 
protein (or “receptor”) in the formation of a stable protein-ligand 
complex. This orientation can then be used to predict binding 
affinity of the ligand for the receptor protein through the use of a 
scoring function. These techniques are most commonly used in 
structure-based drug design and engineering of proteins towards 
desired substrates or products. 

The basic protocol involves (a) a search algorithm coupled with 
(b) a scoring function. The docking problem can be approached as 
matching the shape complementarity [28] or the pairwise interac- 
tion energies [29] of the ligand and receptor. The searching algo- 
rithm involves systematic searching of the optimal ligand binding 
pose. This can be done by exploring all rotatable bonds of the 
ligand [30] and molecular dynamics simulations [31] or by genetic 
algorithms [32]. The various poses are evaluated by a scoring 
function — usually a physics-based molecular mechanics function 
that calculates the energy of binding for the ligand [33]. 


Due to its importance in structure-based drug design, many com- 
mercial docking programs are available including Glide [34], 
GOLD [32], and MOE [35]. Some small molecule docking servers 
are also available, such as SwissDock [36], PatchDock [37], and 
EADock [38]. Software that is for download and free for academic 
use include DOCK [39], AutoDock [40], and AutoDock Vina 
[41, 42]. 


Most proteins operate at physiological conditions, and for most 
research purposes, this is not an issue. Given a structure, or accurate 
protein structure model, it is possible to identify catalytic residues 
and binding pockets and also obtain docked substrates /inhibitors 
as already discussed. However, when it comes to the use of proteins 
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as biocatalysts for industrial purposes — a field becoming more and 
more active due to the potential for environmentally friendly bio- 
synthesis of fine or bulk chemicals — this becomes a problem. 
Industrial reaction conditions, including higher temperatures, as 
well as substrate and product concentrations required for large- 
scale chemical biosynthesis, often lead to loss of stability, causing 
poor enzyme efficiency and loss of activity. 

Certain strategies can be effectively engaged to improve protein 
physical and reactive properties to produce active biocatalysts 
beyond physiological conditions. Directed protein evolution, com- 
bined with rational and semi-rational approaches to mutational 
library design, is a very effective means of improving protein activity 
under industrial conditions. This process has four basic steps: 


A. Start with an appropriate sequence — referred to as the “back- 
bone.” This is a sequence that possesses some activity (however 
small) on the substrate at physiological conditions, or pre- 
dicted to be active given certain mutations within the active 
site (if the substrate is non-native). 


B. Library construction on the backbone. Directed evolution 
involves the use of “libraries” of mutations. The most basic 
form of library construction involves error-prone PCR 
[43]. Other, more targeted libraries for specific amino acid 
residues involve site-saturated mutagenesis [44] for single 
positions and potentially the whole enzyme, as well as pre- 
dicted single mutations. Combinatorial libraries can be used 
for optimizing multiple positions together. 

C. The mutants are then screened at the required conditions, or 
conditions close to those needed. Most mutations will be 
deleterious or neutral, but a number will be beneficial [45]. 


D. The beneficial mutations are recombined onto the most active 
hit from the first round of evolution, often through a combi- 
natorial library. The cycle then begins again until the desired 
conditions are met by the evolved biocatalyst. 


This cycle can be used to improve any functional property of 
the protein of interest. For most basic research, a few rounds of 
evolution is all that is necessary for proof of concept. This is 
illustrated well in a study for production of simvastatin by the 
LovD enzyme [46], with a parallel, longer optimization of the 
same enzyme for industrial production [47]. 


The overall pipeline is illustrated in Fig. 1. All methods described in 
this work can be run on a standard Linux OS-based laptop, desk- 
top, or workstation. Many of them are also compatible with Mac 
OX/S and Microsoft Windows, allowing for ease of use for any 
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D Directed Evolution 
{ 


Fig. 1 A pipeline for interrogation of protein sequences. The sequence of interest is obtained by genome 
mining (a), followed by three-dimensional protein structure prediction using AlphaFold (b). This model can 
then be used to dock substrates/ligands of interest with AutoDock Vina (ec), laying the groundwork for directed 
evolution to improve protein activity, stability, or other functional parameters (d) 


experimental researcher. For visualization of protein structures and 
models, PyMOL (Schrodinger) is recommended [19]. This is avail- 
able to download for a subscription to an academic license with 
newer versions, but older, freely available unsupported versions 
may also be found online. 


2.1 Searching BLAST software is obtained from the NCBI website (https:// 
Genomes and blast.ncbi.nlm.nih.gov/Blast.cgi7PAGE_TYPE=BlastDocs & 
Proteomes with BLAST DOC_TYPE=Download), and HMMER software can be found 
and HMMER at the HMMER homepage (http://hmmer.org). BLAST databases 


are also available for download, but here we will describe how to 
create a searchable sequence database for BLAST using a genome/ 
proteome FASTA file. The BLAST software includes a script called 
“makeblastdb,” which performs the necessary conversion. Having 
installed the BLAST software, and given a proteome sequence file 
“proteome.fa,” the following Unix command may be used to create 
the database (see Note 1): 


$ makeblastdb -in proteome.fa -parse_seqids -blastdb_version 
5 -title “Proteome Database” -dbtype prot 


If a nucleotide sequence file is used, then the dbtype should be 
set to “nucl.” To search the newly created database with protein- 
protein blast (blastp) using a query sequence “query.fa,” the fol- 
lowing command is used: 


$ blastp -db proteome.fa -query query.fa -out query_results.out 


This can be tailored for any of the BLAST programs. 

HMMER software only requires a multi-sequence FASTA file 
without the requirement for conversion to a database. Profile 
HMMs are found in the “Pfam-A.hmm” file and can be obtained 
by using the “hmmfetch” command to retrieve individual HMM 
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2.1.1 Searching for 
Biosynthetic Gene Clusters 
with antiSMASH 


2.2 Modeling Protein 
Structures with 
AlphaFold Through 
Google Colab 


profiles of interest (see Note 2). The profiles can be used to search 
the proteome sequence file as follows: 


$ hmmsearch profile-hmm proteome.fa > profile_results.out 


antiSMASH is available for download as well as offered as a server 
(https: //antismash.secondarymetabolites.org/). Links to the fun- 
gal and plant versions are on the website. The inputs are a genome 
file in FASTA or GenBank format and an annotation file in GFF3 
format (see Note 3). The web-based output allows for full inter- 
rogation of potential BGCs, as well as downloading the identified 
sequences for further study. 


AlphaFold2 software is available for download but requires the use 
of a GPU cluster. However, an alternative exists for the researcher 
with restricted computational power: Google Colaboratory offers 
“ColabFold” as a service to the scientific community [48]. The 
researcher can access the Colab notebook for AlphaFold and gen- 
erate accurate protein models within an hour or two, in an interac- 
tive setting, using Google’s GPU clusters (see Note 4). The basic 
protocol is as follows: 


A. Access the Colab notebook (https://colab.research.google. 
com/github/sokrypton/ColabFold/blob/main/ 
AlphaFold2.ipynb). 


B. Paste the target sequence into the notebook and add a name 
for your project. AlphaFold is optimized also for multimers 
and complexes. If you wish to model a dimer, paste 
SEQUENCE1:SEQUENCEL, using a colon as a break. For a 
02B2 tetramer, you would paste SEQUENCE1:SEQUENCEL: 
SEQUENCE2:SEQUENCE2. 


C. Check “use_amber” and “use_templates.” The use of tem- 
plates does not affect the overall model structure, but as they 
can be used as extra restraints in the prediction, it is worth 
adding them, and sidechains must be optimized through 
AMBER [49]. Although it can double the runtime for struc- 
ture prediction, accurate sidechain prediction is essential for 
any further use of the model for docking studies, or protein 
engineering. 

D. Go to the Runtime tab and click “run all.” The server runs 
interactively, so the browser window must remain open at all 
times (a subscription is required to keep data if the window is 
accidentally closed, or the computer goes into sleep mode). 


E. After completion, a .zip file is created containing all of the 
results and automatically downloaded to the computer. Colab- 
Fold outputs five amber-relaxed and unrelaxed models, the 
input multiple sequence alignment generated from sequence 
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Cc rank_1 


Predicted IDOT per position 


Fig. 2 Results from a ColabFold structure prediction for the protein product of the chalcone synthase gene 
Cav01g29270 from hazelnut. Outputs include a plot of sequence coverage of the multiple sequence alignment 
(MSA) to the target sequence (a), a plot of predicted local distance difference test (pLDDT) scores (b) for the 
five predicted models (colored by chain), and a plot of predicted aligned error (PAE) between the chains for the 
highest ranked dimeric model (c). The pLDDT score is very high across the model, and the PAE score between 
the chains is very low, indicating a confident prediction of the protein structure. This is a function of the very 
high sequence coverage of the MSA across this sequence. The worst scoring region is the first 20 residues at 
the N-terminus of the sequence, which have little to no sequence coverage in the MSA. (d) shows the highest 
ranking structure obtained colored by pLDDT (spectrum of red (50 and below) to blue (90 and above)), showing 
both the overall high score for this model and the low-scoring N-terminal residues 


2.3 Small Molecule 
Docking with 
AutoDock Vina 


databases, as well as several plots including sequence coverage 
and predicted local distance difference test (pLDDT) scores for 
all residues. 


Figure 2 shows some typical results for a ColabFold run. The 
sequence used here is the predicted protein sequence of 
Cav01g29270, a chalcone synthase gene obtained from the 
recently sequenced genome of the European hazelnut (Corylus 
avellana L.) [50]. The chalcone synthase family is a large sequence 
family, and a good quality model was obtained due to the high 
sequence coverage of the multiple sequence alignment obtained 
against most of the length of the predicted sequence. 


AutoDock Vina is a new generation of AutoDock. First released in 
2010 from the Olsen group at the Scripps Research Institute [41], 
Vina enabled fast, accurate small molecule docking with limited 
computational power. It also included the treatment of the receptor 
as flexible (identified flexible sidechains in the binding pocket could 
be included in the search space). It has recently undergone a 
revision (version 1.2) [42] which allows it to use the AutoDock 
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scoring function (AutoDock4.2). It also has expanded capacities for 
specific water molecule inclusion and simultaneous docking of 
multiple ligands (ideal for where an enzyme binds a cofactor as 
well as a substrate, or for looking at binding of enzymatic cleavage 
products). A short but informative video tutorial is available online 
(see Note 5), but we shall consider the basics here: 


A. The associated program AutoDockTools can be downloaded 
and used to do three things. Firstly, the protein is prepared for 
docking by adding polar hydrogens (all non-polar hydrogens 
are removed and are implicitly treated as part of the non-polar 
heavy atom they are bonded to). Protonation states for active 
site histidines should be checked. Also, the relative rotameric 
conformation of histidine, asparagine, and glutamine 
(“HNQ”) should be checked to ensure optimal hydrogen- 
bonding networks using an online server such as WHAT IF 
[51]. Secondly, a grid that encompasses the entire binding 
pocket is calculated, and dimensions/central origin can be 
noted down in a text file to define the search space. The ligand 
is also parameterized and all rotatable bonds identified for the 
search. The protein and ligand are stored as PDBQT files, 
which are based on the PDB format, but include atomic 
charges (and rotation information for the ligand). 


B. A configuration file (“docking.conf”) can be created where the 
protein and ligand PDBQT files are specified, as well as the 
name of the file to be saved containing the docked poses and 
the calculated grid dimensions. 


C. The program can then be run using the following command: 
$>vina --config docking.conf --log docking.log 
where vina stands for the full path to your installation of 
the program. The docked models can then be assessed either in 
AutoDockTools, or in PYMOL. Lowest energy docking poses 
can be checked to see whether they are in appropriate positions 
to affect protein activity (i.e., is this substrate in an appropriate 
position for enzyme catalysis, etc.). 


Having a three-dimensional protein model (with a docked sub- 
strate /inhibitor/ligand) is a good start for a protein engineering 
project. Here we will discuss a few simple strategies for improve- 
ment of physical and functional properties of proteins. 

The first thing is to make a list of the amino acid residues in 
different parts of the structure. Using the docked substrate, it is 
straightforward in PyMOL or other protein visualization tools to 
identify important clusters of residues. It is helpful to divide the 
whole protein structure into four (or five) basic bins: 


© The active site/binding pocket. Calculate all amino acid residues 
<4 A away from the bound substrate. This should include the 
catalytic residues of your enzyme and the constellation of 
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General Stability 
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residues that make up the substrate-binding pocket. These posi- 
tions are most likely to affect the activity of the enzyme and the 
specificity and stereoselectivity of the reaction products. 


¢ The cofactor-binding residues. This is an optional bin for those 
enzymes which require a small-molecule cofactor for activity. 
Again, calculate all positions within 4 A of the cofactor mole- 
cule. These positions can often affect the overall stability of the 
protein, as the binding of the cofactor by the protein adds to the 
stability through a chelation effect (the stronger the binding to 
the cofactor, the more stable the protein). 


© The secondary sphere residues. These are all positions between 
4 and 8 A of the substrate/ligand and may have an effect on 
specificity and activity due to their relative closeness to the active 
site and binding pocket. 


¢ Core residues. These are the rest of the residues found in the 
interior of the protein, mostly hydrophobic, and contribute 
more to overall stability. 


© Surface residues. All positions found on the surface of the 
enzyme. They may be fully exposed to solvent or partially buried 
in the core. 


For multimeric proteins/enzymes, another bin that may be of 
use is multimeric interfaces, which also contribute to overall stabil- 
ity (identifying all residues on that surface will be sufficient). This 
binning is not essential but is useful both in targeting for functional 
improvements and in interpreting observed improvements from a 
structural perspective, as one can easily identify the locations of 
positions that are potentially evolvable for functional purposes. 
We will now discuss strategies for improving three functional para- 
meters — overall stability, activity /specificity, and thermostability. 


A relatively straightforward way of library design for improving 
protein stability is to identify “most common” mutations relative 
to your sequence. The basic idea is that through the evolution of a 
particular family of proteins, many amino acids (both specific and 
types) are conserved to maintain the stability of the protein, regard- 
less of its specific activity. A basic strategy here would be to obtain a 
list of related sequences, from 40% to 25% sequence identity (the 
identity zone at which the structures are likely to be conserved 
across all sequences). This can be done using BLAST against a 
non-redundant protein sequence database. The sequences can 
then be aligned to produce a multiple sequence alignment using, 
for example, Clustal Omega [52]. This alignment can be used to 
identify all amino acids at all given positions, and percentage amino 
acids at each position can be calculated. If your sequence has, for 
example, isoleucine at position 14, and the greatest percentage at 
position 14 in your multiple sequence alignment is for valine, then 
you can take 114V as a potential stability-enhancing mutation. This 
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2.4.2  Activity/Specificity 


2.4.3 Thermostability 


can be done across the sequence for as many positions as the 
researcher desires, and either individual mutations or, more effec- 
tively, a combinatorial library of mutations can be generated for 
your sequence and experimentally tested. The more stable the 
enzyme, the greater potential for further evolvability exists [53]. 


Here, the most likely place for alteration of enzyme activity or 
specificity is the active site and the binding pocket. Given a docked 
substrate-enzyme complex model, the residue positions around the 
substrate can be identified as above, and particular mutations can be 
specified to enlarge /reduce the pocket size and add complementary 
charged/polar or hydrophobic residues. This can be performed 
with both native and non-native substrates. For a more thorough 
analysis, site-saturation mutagenesis is an excellent way of identify- 
ing specific beneficial mutations, which can then be recombined in 
following rounds of evolution. However, improvements in activity 
can occur through mutation far from the active site [54], and error- 
prone PCR is also a useful strategy to obtain random beneficial 
mutations around the protein, which can be combined with 
activity-improving active site mutations. 


Thermostability is important for enzymes which may be required to 
perform reactions at higher temperatures. Many studies have iden- 
tified specific alterations to improve thermostability in proteins, and 
these can be used, along with structural details, to identify potential 
thermostable mutations [55]. The enzyme model and calculated 
residue bins mentioned above are very useful at this point. For core 
residues, increased branching of aliphatic sidechains (A > V; 
V > L/I) is often correlated with thermostability. For the surface, 
removal of flexible glycine and reactive sulfur-containing cysteine / 
methionine and increase in surface prolines to increase rigidity may 
also contribute. For polar surface residues, increasing the numbers 
of salt bridge pairs (D/E vs K/R) and removal of reactive aspartyl- 
prolyl motifs (DP) are recommended. The DP motifs are often 
found at the beginning of turns and helices, and mutation of 
aspartic acid to asparagine, serine, or threonine is suggested here. 
The model can be used to identify all these sites of potential 
improvement. 

For thermostable mutations to be identified, both an analysis of 
the structure and the multiple sequence alignment can be used to 
find positions with naturally occurring diversity that can be used to 
produce combinatorial libraries for improving thermostability. One 
useful tool is the Sorting Intolerant from Tolerant (SIFT) server 
[56]. This allows the user to submit his or her sequence of interest, 
performs a sequence similarity search, and calculates which amino 
acids are tolerated/not tolerated at all positions. This data can be 
used to identify potential sequence diversity for library 
construction. 


3 Conclusion 


4 Notes 
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Having set out a pipeline for interrogation and experimentation of 
a biological sequence from genome through to full protein engi- 
neering, it is hoped that the experimental researcher will be 
empowered to sit down at a workstation and augment his or her 
experimental data with atomic-level detail for increased under- 
standing of their protein(s) of interest. Many of these procedures 
have become faster and more accurate over time and possess great 
explanatory power, granting detailed insights into protein struc- 
ture, function, and engineering possibilities. 


1. Although the command for BLAST is used here, the full file 
location pathway of the BLAST program should be used here, 
e.g., “makeblastdb” may be located in /usr/local/bin/blast/ 
which should be added at the start. 


2. This is the simplest way of running HMMER for protein 
sequences. It can also be used for nucleotide sequences. To 
obtain the profile family names, run a query sequence through 
the PFAM website sequence search (http://pfam.xfam.org) 
and observe the protein family profiles identified in the search. 


3. The annotation file to be used with the antiSMASH software is 
to be checked to ensure it matches the genomic FASTA file, or 
the program quits within 15 min of running. 


4. The Google Colab version of AlphaFold can handle a maxi- 
mum of 1400 residues (monomer or multimers). A larger 
protein/protein complex would require local installation of 
the software. The accuracy of the model is dependent on 
both the number of related sequences in the multiple sequence 
alignment and the coverage of those sequences across the 
length of the target sequence. AlphaFold is very good for 
enzymes that match the traditional “lock-and-key” model of 
protein-ligand interactions. However, where the model is 
“induced fit,” or allostery is important, or the enzyme requires 
a large conformational change in order for activity, some qua- 
lifications are in order. This author has tried to model the 
epithelial growth factor receptor (EGFR) kinase domain and 
obtains accurate models of the “closed” conformation only. 
Likewise, AlphaFold does not model cofactors or metal ions, 
but as it is trained on the PDB structure database, some models 
may be “pre-organized” for inclusion of these cofactors, and 
docking may be relatively easy. Also, AlphaFold cannot accu- 
rately model the effect of single mutations on a structure (e.g., 
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some mutations are known to cause large conformational 
changes in kinases), but as the alignment only samples native 
sequences from databases and derives correlated mutations 
from these, all mutations added to the target sequence will be 
modeled in the same conformation as the “native” structure. 


5. Some differences are to be expected in the AutoDock Vina 
online tutorial as it dates back to 2010 (e.g., the option “all” 
is now called “out” in the newer Vina version). Refer to the 
updated manual for a more detailed tutorial. 
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Creating De Novo Overlapped Genes 


Dominic Y. Logel and Paul R. Jaschke 


Abstract 


Future applications of synthetic biology will rely on deploying engineered cells outside of lab environments 
for long periods of time. Currently, a significant roadblock to this application is the potential for deactivat- 
ing mutations in engineered genes. A recently developed method to protect engineered coding sequences 
from mutation is called Constraining Adaptive Mutations using Engineered Overlapping Sequences 
(CAMEOS). In this chapter we provide a workflow for utilizing CAMEOS to create synthetic overlaps 
between two genes, one essential (7wfA) and one non-essential (aroB), to protect the non-essential gene 
from mutation and loss of protein function. In this workflow we detail the methods to collect large numbers 
of related protein sequences, produce multiple sequence alignments (MSAs), use the MSAs to generate 
hidden Markov models and Markov random field models, and finally generate a library of overlapping 
coding sequences through CAMEOS scripts. To assist practitioners with basic coding skills to try out the 
CAMEOS method, we have created a virtual machine containing all the required packages already installed 
that can be downloaded and run locally. 


Key words Deep learning, Machine learning, Generative model, Markov random field, Overlapping 
genes, Multiple sequence alignments, Protein design, Genome compression, Synthetic genomes, 
Synthetic biology 


1. Introduction 


Synthetic biology has led to an explosion in designed genomic parts 
driving the production of novel functions and molecules [1]. This is 
done through the construction of genetic circuits with natural or 
engineered genes controlled by regulatory elements [2]. To make 
the design of engineered genomes easier, most genome design 
approaches seek to refactor genomes to remove genetic overlaps 
and cryptic regulation [3-8]; however, this does not necessarily 
provide evolutionary stability to designs [9]. In fact, engineered 
genes and synthetic architectures often place a deleterious growth 
burden on the expression host [10-12]. Thus, hosts which have 
lost the engineered gene have a growth advantage and over time 
become the dominant population. This phenomenon has led to 
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ways to constrain evolution by tying the expression of the engi- 
neered part to the expression of an essential host component, thus 
linking organism survival to the retention of the engineered com- 
ponent [13, 14]. Design stability is crucial as many future applica- 
tions for synthetic biology technologies are predicated on usage 
outside laboratories, such as engineered nitrogen fixation in cereal 
endophytes [15] and cleaning environmental pollutants [16, 17]. 

A novel way to add genetic stability to engineered genomes is 
called Constraining Adaptive Mutations using Engineered Over- 
lapping Sequences (CAMEOS) which seeks to emulate the 
condensed and overlapped coding sequence architecture found 
primarily in bacteriophage and bacteria [3, 4, 7, 18, 19]. This 
computational approach uses hidden Markov models (HMMs) 
and random Markov fields (MRFs) to determine protein residue 
diversity at a given position, and residue-residue contacts across the 
proteins, to generate overlapped coding sequences containing two 
proteins [20]. 

The foundation for the creation of protein generative models 
(HMMs and MRFs) is a multiple sequence alignment (MSA) 
[21]. For the accurate production of these models, the MSA must 
encompass thousands to tens of thousands of sequences (Fig. 1a). 
There are multiple algorithms available for performing protein 
alignments such as ClustalW [22], FAMSA [23], and MAFFT 
[24] all of which perform differently. 

Following the creation of an MSA of the two proteins to be 
co-encoded, HMMs and MRFs are generated (Fig. 1b). A HMM 
operates by searching for patterns in a sequence space and calculates 
the probability ofa pattern, or state, occurring (e.g., G, C, A, and T 
having a 25% chance) and the transition probability of changing 
states (e.g., 75% change of moving from state 1] to state 2). The 
hidden component is the transitions between states inside an 
observed sequence. The role of the HMM is to represent protein 
sequence conservation across protein family members [25, 26]. A 
MRE is an undirected graphical probability model and represents 
combinations of independent assumptions which more directed 
models, such as Bayesian modeling, cannot accurately depict 
[25, 26]. The role of the MRF is to represent intra-protein 
residue-residue coupling which may be crucial to protein function. 

As the HMM detects conserved direct relationships and the 
MRF detects conserved indirect relationships, these models 
together create a fuller picture crucial of protein sequence and 
structure in targeted proteins [26]. HHMs are used in CAMEOS 
to create co-encoding solutions that are subsequently used as seeds 
in a second step where long-range interactions between protein 
residues are assessed with the MRFs [20]. 
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A. Building Multiple Sequence Alignments 


1. Select target coding sequences 4. Remove outliers from MSA 


Website: EcoCyc Input File(s): proteinA_alignment.msa 
Output File(s): proteinA.fasta and proteinB.fasta proteinB_alignment.msa 
proteins. fasta and cds.fasta Program: OD-seq 
Output File(s): proteinA_trimmed_alignment.msa 


2. Download protein familes from Pfam or InterPro 
proteinB_trimmed_alignment.msa 


Website: InterPro or Pfam 

Steps: Search sequence for protein family and Note: maintain enough sequences for modelling 
download all sequences N/vVL>~200 

Output File(s): proteinA_family.fasta Where: 
proteinB_family.fasta N is number of sequences in MSA 


Lis length of protein of interest 


3. Construct multiple sequence alignment (MSA) 5. Construct MSA (round 2) 


Input File(s): proteinA_family.fasta Input File(s): proteinA_trimmed_alignment.msa 
proteinB_family.fasta proteinB_trimmed_alignment.msa 

Program: MAFFT/FAMSA Program: MAFFT/FAMSA 

Output File(s): proteinA_alignment.msa Output File(s): proteinA.msa 
proteinB_alignment.msa proteinB.msa 


B. Generating protein structure models 


6. Train Hidden Markov Model (HMM) 7. Train Markov Random Field (MRF) 


MFRVELENGHVVTAHISGKM gnchvn 
can ea fs a ae > aa 


A 
a 
s 


19> 
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& 
Input File(s): proteinA.msa Input File(s): proteinA.msa 
proteinB.msa proteinB.msa 
Program: hmmer Program: CCMpred 
Output File(s): proteinA.hmm, .h3f, .h3i, .h3m, and .h3p Output File(s): proteinA.raw 
proteinB.hmm, .h3f, .h3i, .h3m, and .h3p proteinB.raw 


C. Designing synthetic overlapping gene sequences 
8. Convert CCMpred output to Julia compatible file type 


Input File(s): proteinA.raw and proteinB.raw 


Script: convert_ccm_to_jld,jl 
Output File(s): proteinAjld and proteinB,jld 
9. Summarize pseudolikelyhoods and energies 
Input File(s): proteinAjld 
proteinBjld 
Script: energies_and_psls.jl 
Output File(s): psls_proteinA.txt and energy_proteinA.txt 
psls_proteinB.txt and energy_proteinB.txt 
10. Generate gene overlap variants 
Input File(s): runfile.txt proteinAjld proteinA.-hmm 
proteinBjld proteinB.hmm 
Script: main.jl | outparser,jl 
Output File(s): output/ 
summary_BC.csv, all_final_fitness_BC.txt, top_twelve_BC.fa, saved_pop_BC,jld, log_BC.txt, and others 


Fig. 1 CAMEOS workflow. (a) The first step in the CAMEOS workflow is to create and curate MSA for the two 
target proteins. This process requires downloading protein family libraries from Pfam or InterPro, aligning 
these sequences through FAMSA or MAFFT, removing outliers via OD-seq, and repeating the alignment. (b) 
The second step is creating protein structure models (HMM and MRF) through hmmer and CCMpred. (c) The 
final step uses the CAMEOS scripts to generate the synthetic overlapping proteins and the library of the 
overlapping coding sequences 
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2 Materials 
2.1 Hardware 


2.2 Software 


In this chapter, we describe the steps needed to use CAMEOS 
to design de novo overlapped genes. We detail the processes to 
assemble sequences and generate multiple sequence alignments, 
create HMMs and MREFs, run scripts included in the CAMEOS 
directory to preprocess the input data into the correct formats, and, 
finally, run the CAMEOS algorithm itself (Fig. 1c). 


Intel Core 17-4770 3.40GHz with 4 cores and 32 GB RAM. 


1. Ubuntu v20.10: https: //ubuntu.com/download/desktop. 


2. HH-suite (v3.3.0) [27], an open-source package for sensitive 
protein sequence searching based on the pairwise alignment of 
hidden Markov models (HMMs). GitHub: https://github. 
com/soedinglab/hh-suite. 


3. GCC v4.4+, a C compiler written for the GNU operating 
system. Website: https://gcc.gnu.org/. 


4. CMake v2.8+, an open-source cross-platform tool family to 
build, test, and package software. Website: https: //cmake.org/ 


5. CCMpred [28], an open-source package for learning protein 
residue-residue contacts for building Markov random fields 
(MRE). GitHub: https: //github.com/soedinglab/CCMpred. 


6. CAMEOS [20], an open-source package to generate de novo 
overlapped sequences. GitHub: https://github.com/Bio 
secSFA/CAMEOS. 


7. Julia (v1.4.1), a dynamic language for technical computing. 
With packages: BioAlignments, BioSymbols, Logging, Stats- 
Base, JLD, Distributions, ArgParse, and NPZ. 


8. Python (v3.9.5), an open-source cross-platform programming 
language. Website: https: //www.python.org/. 


9. HDF5 (v1.10.6), a data software library and file format to 
manage, process, and store heterologous data. Website: 
https: //www.hdfgroup.org/solutions/hdf5 /. 


10. gzip (v1.10), a data compression program for the GNU operating 
system. Website: https: //www.gnu.org/software/gzip /. 


11. hmmer v3+ [29], an open-source package for searching 
biological sequence databases for homologous sequences. 
GitHub: https: //github.com/EddyRivasLab/hmmer. 


12. zliblg-dev (v1.2.11) and groovy, a compression deflation 
method found in gzip and PKZIP. Website: https: //packages. 
ubuntu.com /bionic/zlib1 g-dev. 


3 Methods 


3.1. Choose Protein 
Sequences to Overlap 


Creating De Novo Overlapped Genes 99 


13. OD-seq, an MSA analysis software package which detects out- 
lier sequences. Download: http://www.bioinf.ucd.ie/down 
load /od-seq.tar.gz. 


14. FASTX-Toolkit (v0.0.14), a collection of command-line stools 
for Short-Reads FASTA/FASTQ files preprocessing. GitHub: 
https: //github.com/agordon/fastx_toolkit. 


15. MAFFT (v7.310), a multiple sequence alignment program for 
Unix-like operating systems. Website: https://mafft.cbre.jp/ 
alignment/software/ 

https: //anaconda.org/bioconda/mafft. 


16. FAMSA (v1.6.2), an algorithm for large-scale multiple 
sequence alignments. Website: https://github.com/refresh- 
bio/FAMSA and https: //anaconda.org/bioconda/famsa. 


Here, we describe the overall workflow to go from a pair of proteins 
we want to overlap to the output of DNA sequences that can be 
synthesized and tested in a wet lab. Complementary information to 
what is presented here can be found in the excellent manual. pdf 
file within the original CAMEOS GitHub repository (https:// 
github.com/wanglabcumc/CAMEOS /tree/master/doc). 
Throughout this “Methods” section, we use code that was down- 
loaded from a fork of the original CAMEOS GitHub repository on 
1 Dec 2021 (https://github.com/BiosecSFA/CAMEOS) that 
improved the original code in several ways. For details, see notes 
(https: //github.com/wanglabcumc/CAMEOS/pull/2). For a 
more comprehensive description of the development and theoreti- 
cal underpinnings of the CAMEOS method, please see the original 
publication by Blazejewski et al. [20]. 


The choice of which two proteins to overlap is nearly limitless, but 
there will be constraints based on sequence similarity and compati- 
bility at the amino acid and DNA (coding sequence) level. The 
CAMEOS method was originally used to generate two sets of E. coli 
sequence pairs (CysJ-InfA and IlvA-CcdB), via >7500 designs. A 
subset of these designs that were experimentally characterized 
showed that protein function and activity were maintained in 
both co-encoded proteins across their designs [20]. Additionally, 
5.8 million theoretical overlaps between 199 essential genes and 
49 non-essential biosynthetic gene sequences were computed. 
These analyses showed that 9% of their computationally analyzed 
subset contained pseudo-likelihood scores exceeding the experi- 
mentally characterized sequence pairs. From this, it was inferred 
that 80% of the biosynthetic genes could be encoded with at least 
one essential gene. In this chapter we will use the zwfA (translation 
initiation factor IF-1) and aroB (3-dehydroquinate synthase) E. colz 
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3.2 Download Target 
Protein and Coding 
Sequences 


coding sequences (see Notes 1 and 2) originally included as exam- 
ples with the CAMEOS code on GitHub. All following examples 
will be just for InfA, but it should be assumed that, where 
appropriate, the same process must also be done for AroB 
sequences. Additionally, because the AroB protein is longer, the 
analyses of this protein will take longer and may require more 
computational resources. 


There are several sources of very large multiple sequence align- 
ments (MSAs) that can be used as a starting point fora CAMEOS 
experiment. We will focus here on Pfam [30] and InterPro [31] 
databases, which are both large collections of protein families cre- 
ated and hosted by EMBL-EBI. 

We will use these databases as sources of large numbers of 
homologous sequences we can use to produce our own high- 
quality MSAs that are then fed into the CAMEOS workflow. 


1. To determine how many protein sequences at minimum we will 
need for our alignments, we can approximate using this for- 
mula: N/sqrt(L) > ~200 where N is number of sequences in 
MSA, sqrt() is the square root, and L is the length of protein of 
interest in amino acids (https://github.com/wanglabcumc/ 
CAMEOS). For InfA with a length of 72 aa, the minimum 
number of sequences in the MSA would need to be N= sqrt 
(72) x 200 > 1697. For AroB, with a length of 362 amino 
acids, the minimum number would be more than double at 
3805 sequences. This number of sequences could be fulfilled 
from either Pfam or InterPro, but we will detail how to down- 
load sequences from InterPro as it provides a higher number. 


2. First navigate to: https://www.ebi.ac.uk/interpro/search/text/. 


3. Keyword search for “IF-1” since this is the protein encoded by 
infA gene (see Note 3). 
4. Potentially more accurate searching can also be done using the 
amino acid sequence of the protein of interest. In this case 
you would use the “Search — By Sequence” menu option 
of InterPro and enter the FASTA sequence of the protein you 
were interested in identifying the protein family of (Fig. 2a) 


5. Click on “ACCESSION” link (IPR004368) for “Translation 
initiation factor IF-1” from InterPro under the “SOURCE 
DATABASE” heading. 


6. Click on Proteins (46 K) header, within this tab. Further 
filtering of the family can be performed to separate “reviewed” 
and “unreviewed” sequences. However, as the 634 reviewed 
proteins falls beneath the >1697 sequences required, we will 
continue with the unfiltered data. 


7. Click on triangle portion of blue Export button on right-hand 
side of page. 
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ASTA fi tain a Uist of ximately 46k UniProt proteins th the InterPro entry wit IPROOS368. 


We expect this file to contain 46k distinct proteins. If you encounter any problems during the creation of this file, please check the “Code snippet” section of this page 
for to see how to download the data directly onto your computer. 


Fig. 2 InterPro website navigation. (a) Search using the protein sequence of choice in the InterPro search bar 
to identify the protein family of the target protein. (b) After selecting the search results and navigating to the 
protein family, move to the Protein tab on the webpage, find the Export function, and click on the See More 
Download Options when hovering over the FASTA Generate button. (c) On this page, select the chosen data 
outputs and click Download 


8. Hover the cursor over the Generate button beside the FASTA 


entry and you will see a popup window with “See more down- 
load options” (Fig. 2b). Click the button. 


9. On the new page that opens, the “Choose a main data type” 
header should be “Protein.” 


10. Scroll down, and under “Select Output Format” heading, 
change to “FASTA.” 


11. Scroll down to bottom of page and click the Generate button 
(see Notes 4 and 5). 


12. When the data is ready for download, the “Download” button 
will light up (Fig. 2c). Click this button and name the file 
infA.fasta. 
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3.3 Gathering 
Additional Sequences 
with HHblits 


3.4 Gathering 
Additional Sequences 
with PSI-BLAST 


While InterPro and Pfam are excellent resources for downloading 
the sequences of protein families, sometimes more sequences are 
required to train the protein models than are provided by these 
sources. A way of gathering additional sequences is using the 
HHblits tool within HH-suite which iteratively searches sequences 
to detect similarities building high-quality MSAs [27]. There are 
two methods for using HHblits for gathering aligned sequences: 
either using HHblits command hhblits on a CLI or via the 
HHblits webtool. 

The HHblits webtool (https://toolkit.tuebingen.mpg.de/ 
tools/hhblits) will be discussed first as it is the simpler approach, 
although it offers fewer user input options. The tool requires a 
single protein sequence of interest or a MSA as the starting point. 
Additionally, the user is able to specify the following search para- 
meters: (1) the Expect (E) value cutoff for inclusion, the (2) num- 
ber of HHblits search iterations performed, the (3) minimum 
probability in the hitlist, and the (4) maximum number of target 
hits. All modification options are available within a dropdown 
menu, and the default settings are clearly noted. Within a few 
minutes of submission, HHblits will return results listing the 
(1) number of hits, (2) their identity, and (3) the alignment of 
those sequences (Fig. 3a). For the downstream processes, the user 
must navigate to the “Query Template MSA” tab and download 
the full MSA file by clicking the option “Download Full MSA.” The 
file generated from this is in a protein.a3m file format and can be 
easily used with MAFFT without conversion to a protein.msa file 
extension (Fig. 3b). 

The other option to gather more sequences using the HHblits 
algorithm is to download the HH-suite package from GitHub 
directly (see Note 6) and run hhblits from a CLI. The associated 
manual on hhsuite is very detailed and easy to use; however, the 
tool requires a large downloaded sequence database (50+ GB) to 
function. 


A complementary approach to the protein domain-focused data- 
bases InterPro and Pfam, and the search tool HHblits, is the 
algorithm Position-Specific Iterated BLAST, or PSI-BLAST. PSI-- 
BLAST is a publicly available database search tool hosted by NCBI 
which performs an iterative search function against a protein query. 
Detailed instructions for PSI-BLAST are on the NCBI website and 
in this reference [32]. PSI-BLAST has some advantages over the 
previously mentioned tools as it provides increased sequence cover- 
age by trading off poorer identity coverage. This is important as 
building the HMM will require a low number of gaps to generate a 
HMM profile that is usable with CAMEOS. 
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Fig. 3 HHblits webserver outputs. (a) Once HHblits has queried the user input sequence, the server generates a 
visualization output aligning returned sequences to the input sequence within the Results Table (b) MSA (in the 
.a3m format) for the sequence search are accessed via the Query Template MSA tab and can be downloaded 


in Reduced or Full formats 
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3.5 Perform Multiple 
Sequence Alignment 
Using MAFFT 


PSI-BLAST is an easy to use tool requiring a single protein 
sequence input. To use the PSI-BLAST function, users navigate to 
the protein BLAST (blastp) on NCBI, enter the input protein in 
the Query Sequence section, and select the PSI-BLAST algorithm 
in the Program Selection section. From the original input, 
PSI-BLAST will generate search results matching the input protein 
limited either by sequence counts (default = 500) or by E value 
cutoff (default is 0.005). These settings can be modified in the 
algorithm parameters on the initial search page to expand or con- 
strict the available results from the first iteration. The initial results 
are used to seed the second iteration which is controlled by select- 
ing the number of sequences to add in the “Run PSI-BLAST 
iteration 2” input. 

Search results can be filtered after by percentage identity, E 
value, query coverage, and threshold cutoffs. Further iterations 
can be performed to expand the sequence counts. All results can 
be downloaded from the browser as either aligned or unaligned 
sequences in FASTA format. 


The original CAMEOS publication used the FAMSA aligner [23] 
in concert with OD-seq [33] to remove outliers >2 standard devia- 
tions away from the sequences in a dataset, followed by manual 
removal of alignment positions when less than 50% of entries were 
aligned amino acids. Alternatively, we present another method 
below using the MAFFT aligner that seems to produce comparable 
results with less manual intervention. 


1. Because the MAFFT aligner is not as efficient as FAMSA with 
large numbers of sequences (>10,000), it may be necessary to 
take a subsample of the files obtained from InterPro. The 
fasta-subsample tool in the MEME Suite [34] is an easy 
way to do this. Because of interference between different tools 
used in this protocol, we used the Conda environment and 
package manager [35] to create a new environment just to 
run the MEME suite. All tools we use in this protocol were 
within the Ubuntu Linux operating system. After starting up 
a CLI: 


(base) $ conda activate meme 


Then from within that Conda environment where MEME 
suite has been installed, you can use the fasta-subsample 
tool. Navigate to the folder containing the infA.fasta file 
then use the CLI: 


(meme) $ fasta-subsample infA.fasta 10000 >infA_sub_10000. 


fasta 
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where 


10000 The number of sequences you wish to subsample 


infA.fasta The FASTA file containing more sequences than you 
require 


>infA_sub_10000. A command to store the results of the subsampling 
fasta into a file called infA_sub_10000. fasta 


You can check that there are actually 10,000 FASTA files within 
this newly created file using the gr ep command and pipe the results 
to the wc command: 


(meme) S$ grep -o ‘>’ infA_sub_10000.fasta | we -l 
10000 


2. Run the MAFFT tool within our (base) environment on the 
newly created subsampled FASTA file of InfA sequences. 


(base) $ time mafft --add infA_sub_10000.fasta --keeplength 


infA.fasta >infA.msa 


We use the time command to show us how long the 
alignment process took after it has completed. The --add 
flag is used to adding unaligned full-length sequence(s) into 
an existing alignment. In this case we are not adding to an 
alignment but using the subsampled InfA sequences in the file 
infA_sub_10000.fasta. This is done so that the alignment 
does not contain too many gaps and is relative to the target 
sequence. The --keep length flag is used to chop off the ends 
of alignments that go over the E. coli InfA target sequence 
(infA.fasta), which simulates the effects of manual pruning 
(see Note 7). 


3. The sequence alignment may contain outlier sequences that 
would reduce the accuracy of the CAMEOS designs. To 
remove outlier sequences from the MSA, we will use OD-seq 
[33] on the alignment (see Note 8). 


$ OD-seq -s 2 -i infA.msa -c infA_trim.msa 


where 
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3.6 Creation of a 
Protein Generative 
Model 


OD- The program that removes outlier sequences. 


-S Flag to specify the number of standard deviations from the mean 
needed to be removed from the alignment (in this case two 
standard deviations are selected). 


-i Flag to specify input file name (infA.msa). 


-C Flag to specify the output file name for sequences with average 
distance of less than two standard deviations to the rest of the 
sequences in the alignment (infA_trim.msa). 


-O (optional) Flag to specify the output file name for sequences with 
average distance of more than two standard deviations to the rest 
of the sequences in the alignment. 


4. The MAFFT alignment is then repeated to align the sequences 
that were not removed by OD-seq. 


5. The FASTA formatted output of MAFFT (and FAMSA) is not 
directly compatible with the CAMEOS scripts so it must be 
converted to a single-line FASTA format. For CAMEOS to 
recognize the MSA files, each sequence, including gap charac- 
ters (—), must occupy only one line (see Note 9). We will use 
the fasta_formatter command of FASTX-Toolkit (see 
Note 10) to do this: 


S$ fasta_formatter -i infA_trim.msa -o infA.msa -w 0 


where 


fasta_formatter Tool used to reformat FASTA sequences. 


-i Flag to specify input file name (infA_trim.msa). 
-O Flag to specify output file name (infA.msa). 
-w Flag to specify the max. sequence line width for output 


FASTA file. The 0 means that sequence lines will not be 
wrapped and all amino acids will appear on the same line. 


Before two protein sequences can be artificially overlapped with the 
CAMEOS algorithm, each protein sequence must be analyzed to 
create both a HMM anda MRF representation. This is done so that 
the CAMEOS algorithm can determine regions of the proteins 
where sequence flexibility and long-range interactions (residue- 
residue contacts) are amenable to coding sequence overlap in dif- 
ferent reading frames. Unless otherwise stated, we assume the 
proteinA.msa and proteinB.msa files (in our case infA.msa 
and aroB.msa) are within the main/ subfolder of the CAMEOS 
script folder and your CLI program’s present working directory 
(pwd) is also main/. 


3.6.1 Training HMM 
Using Hmmer 
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1. Using our generated infA.msa and aroB.msa files, we will 
first generate hidden Markov models (HMMs) of each protein 


using hmmer (http://hmmer.org/) command hmmbuild. 


S hmmbuild infA.hmm infA.msa 


where 
hmmbuild The function of hmmer that builds a HMM 
infA.hmm Output file name 
infA.msa Input file name 


2. Inspect the resulting information generated in the CLI to 
determine if the HMM was generated correctly. The CAMEOS 
script is able to use .hmm files that are incomplete without 
raising an error, but the end results of the process will be 
incorrect. An indication that your HMM files are not correct 
is the final engineered protein sequences generated will include 
gaps and will be shorter than the actual protein sequences put 
into the script in the proteins. fasta file. 

To ensure these errors do not occur, your HMM must be 
of the same length as the input protein. In the hmmbuild 
output, check that “alen” and “mlen” are the same value 
(72 in this case) and that this value is the same as the length 
of the protein in amino acids (72 aa in this case is the full length 
of InfA). 


hmmbuild :: profile HMM construction from multiple sequence 
alignments 

HMMER 3.3 (Nov 2019); http://hmmer.org/ 

Copyright (C) 2019 Howard Hughes Medical Institute. 

Freely distributed under the BSD open source license. 

input alignment file: slyD.msa 

output HMM file: slyD.hmm 


idx name nseq alen mlen eff_nseq re/pos description 


1 infA 1586 72 72 0.52 0.592 
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3. The generated .hmm file is then compressed to create the final . 
hmm/.h3f/.h3i/.h3m/.h3p files CAMEOS requires, using 


the hmmpress command of the hmmer package. 


S$ hmmpress infA.hmm 


where 
hmmpress The function of hmmer that prepares an HMM database 
infA.hmm Input file name 
3.6.2 Training Markov The MRF model for each protein will be trained using CCMpred 
Random Field Using [28] to create residue-residue contact predictions. These models 
CCMpred are for later use in assessing the impact of protein sequence changes 


and their long-range interactions within a protein family. 


1. First we must convert the MSA files to a format that is compat- 
ible with CCMPred (only sequences, no FASTA headers) using 
an inverse-match grep command: 


$ grep -v ">" infA.msa > infA.ccm 


where 
grep A command-line utility for searching lines 
that match a regular expression 
-V Inverse-match flag 
oa The string to match in the input file 
infA.msa Input file 
> Save results of grep to a file 
infA.ccm Output file name 


2. Next we invoke CCMpred to generate a .raw matrix file. 
S$ cempred -t 1 -r infA.raw -n 100 infA.ccm infA.mat 


where 


ccmpred The ccmpred command. 


Se it (optional) Depending on the number of CPUs you have available 
for the computation, you may want to use a value >1 (default) 
here to complete the calculation faster. 


ae Store raw prediction matrix in RAWFILE format flag. 


infA.raw The output raw file. 


(continued) 


3.6.3 Summarizing 
Pseudo-Likelihoods/ 
Energies 
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-n 100 Compute a maximum of NUMITER operations [default: 50]. 


infA. The input file name. 
ccm 

infA. The matrix output file. 
mat 


3. The .raw file is not in the correct format expected by the 
CAMEOS scripts, so it must be converted to an internal 
MRF file format using a Julia [36] language script (con- 
vert_ccm_to_jld.jl). The script generates a Julia Data 
File (.jld) file that is used when the main. j1 CAMEOS Julia 
script is run later: 


§ julia convert_ccm_to_jld.jl infA.raw infA.jld 


where 
julia Runs script with Julia 
convert_ccm_to_jld_jl Name of script to run 
infA.raw Input file name 
infA jld Output file name 


The .jld files must then be transferred to the j31ds/ subfolder 
or the main. 41 script will fail when run later. 


1. Next we must summarize the data from CCMpred into formats 
that work with the CAMEOS scripts. The pseudo-likelihoods 
and energies of the proteins are used in the optimization pro- 
cess, and so these values are calculated using the energie- 
s_and_psls.jl_ script that will output two _ files: 
psls_protein.txt and energy_protein.txt into the 
psls/ and energies/ subfolders, respectively. These folders 
must already exist within the main/ folder, or the main.j1 
script will fail when run. In our example, the files would be 
named psls_infA.txt and energy_infA.txt. 

The script is run: 


$ julia energies_and_psls.jl infA infA.jld infA.msa 


where 
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3.6.4 Setting Up Folder 
Structure 


julia Runs script with Julia. 
energies_and_psls.jl Name of script to run. 

infA The name of the protein. 

infA jld Input Julia Data File name. Note, this was 


generated in the previous step. 


infA.msa Input MSA file name. Note, this was 
generated in a previous step. 


The main. jl script requires all the input files to be in a certain 
folder structure or it will fail. Within the main/ folder, the correct 
subfolder structure is: 


energies/ 
Containing energy_infA.txt and energy_aroB.txt files. 
hmms / 


Containing infA.hmm, infA.hmm.h3f, infA.hmm.h3i, 
infA.hmm.h3m, infA.-hmm.h3p, aroB.hmm, aroB.hmm.h3f, 
aroB.hmm.h3i, aroB.hmm.h3m, and aroB.hmm.h3p files. 


jlds/ 


Containing infA.j1d and aroB.jld. 

NOTE: as of the time of this writing (early 2022), GitHub does 
not host these files correctly, and instead of aroB.j1d being 
~464.9 MB and infA.jld being ~18.3 MB, they instead are 
134 bytes and 133 bytes, respectively. The reduced-size files will 
cause an error if used as is. Two options to get around this limita- 
tion of GitHub are to download the correct files here: https:// 
cloudstor.aarnet.edu.au/plus/s/jpMOfvlyOY2r4Wi. 

Alternatively, if you run the CAMEOS process from the begin- 
ning, as described in this chapter, you will generate your own 
aroB.jldand infA.jl1d files of the correct size. 


msas/ 
Containing infA.msa and aroB.msa. 


output / 


3.6.5 Running CAMEOS 
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Containing nothing. The folder is empty at the start of the 
script. 


psls/ 


Containing psls_infA.txt and psls_aroB.txt. 
Additionally, within the main/ folder, the following files are 
required: 


cds.fasta, proteins.fasta, runfile.txt 


1. The last step before running the main CAMEOS script is to 
modify the file containing the run parameters. Here, we call the 
file runfile.txt, but you can name it whatever makes sense 
to you. The file contains the parameters used during the main. 
j1 script execution and controls aspects of the CAMEOS 
process, such as how many seeds to optimize and how many 
iterations to perform. These parameters must be adjusted care- 
fully because they can have dramatic effects, such as signifi- 
cantly extending runtime. The runfile.txt parameter file 
has the following structure: 


output/ infA aroB jlds/infA.jld jlds/aroB.jld hmms/infA.hmm 
hmms/aroB.hmm 100 pl 250 


The file is tab-delimited (each unit of text is separated by a tab 
character) and stores the following information (see Note 11), 
where: 


output/ Directory where output files are stored. 
infA Gene/Protein A name. 
aroB Gene/Protein B name. 
jlds /infA.jld Path to jld file containing MRF (CCMpred data) for 
Gene A. 
jlds /aroB .jld Path to jld file containing MRF (CCMpred data) for 
Gene B. 
hmms/infA. Path to HMM file (hmmer data) for Gene A. 
hmm 
hmms/aroB. Path to HMM file (hmmer data) for Gene B. 
hmm 
100 Number of seeds to optimize. 
pl Frame. This parameter should not be modified. 
250 Number of iterations of algorithm. 
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3.6.6 Evaluating 
CAMEOS Results 


The time the CAMEOS method takes to complete depends on 


a number of factors such as sequence lengths, seed value, and 
iteration value. 


To run CAMEOS using the main.j1 script, navigate to the 


main/ folder and type: 


S$ julia main.jl runfile.txt 


where 
julia Runs script with Julia 
main.jl The script to run 


runfile. A file containing the parameters used during the CAMEOS run 
txt 


After a successful run, text will be displayed on the CLI, 


similar to: 


Running CAMEOS using parameters specified in runfile.txt 


CAM 


EOS parameters are: 


output/ infA aroB jlds/infA.jld jlds/aroB.jld hmms/infA.hmm 
hmms/aroB.hmm 100 pl 250 


The 


CAM 


random barcode on this run is: G8wCDktg 
EOS tensor built 


Evaluating HMM seeds 


Beginning long-range optimization. 

Step 0 of 250... 

Step 50 of 250... 

Step 100 of 250... 

Step 150 of 250... 

Step 200 of 250... 

1038.885614 seconds (262.57 M allocations: 310.487 GiB, 1.38% 


gc time) 


1. 


In the output/ subfolder, a number of files are created froma 
successful CAMEOS run. The top_twelve_BC.fa (where 
BC is barcode of the run; in our example above, it would be 
top_twelve_G8wCDktg.fa) file contains the best three 
co-encodings of the two genes of interest from the best score 
of protein A (InfA in our example) and protein B (AroB in our 
example). Additionally, the file also contains co-encodings 
(CDS overlaps) with the best overall score. 

Although in most instances the script will just fail if the 
initial files are not of the correct type and location, we have seen 
a few cases where output is generated but is erroneous. For 
example, if the MSA files that are used have more characters in 
them than the sequences in the proteins.fasta file, the 


Creating De Novo Overlapped Genes 113 


output sequences that are generated will have gaps in them. 
Therefore, careful analysis of the output sequences should be 
done before synthesizing the DNA to make the constructs. 

2. An additional script (from: https://github.com/BiosecSFA/ 
CAMEOS) can be used to summarize the information from the 
jld output file into FASTA and comma separated values (CSV) 
files, which are generally easier to look through. 

With the CLI’s present working directory as main/, exe- 
cute the following code: 


§ julia outparser.jl infA aroB BC --fasta 


where 


julia Runs script with Julia. 


outparser. Julia script that parses the CAMEOS output into easier to 
jl analyze files. 


infA The name of the first protein co-encoded. 
aroB The name of the second protein co-encoded. 
BC Barcode from your CAMEOS run (e.g., G8wCDktg). 


--fasta Generates a FASTA file of the results in addition to a CSV. 


--just- (optional): this flag can be used in addition to the --fasta flag to 
fullseq create a FASTA file that only contains the full sequence. 


3. In the example of z#fA encoded within the avoB gene, we can 
see the top scoring hits are located in either the 5’ or 3’ regions 
of aroB (Fig. 4a). Within the 5’ region, three InfA variant 
designs modified the AroB sequence on average 14% to enable 
the co-encoding of InfA into AroB. A similar result was 
observed in the 3’ region of aroB as the five InfA variants 
there modified AroB on average 13%. Despite being in two 
distinct regions, all designs incorporated a new residue at posi- 
tion 30. In most designs, this was lysine; however, in designs 
1 and 20, an arginine and tryptophan were incorporated, 
respectively (Fig. 4b). While no crystal structure is available 
for E. coli AroB, on UniProt an AlphaFold simulation is pub- 
licly available predicting the protein structure [37, 38]. The 
inserted AroB residue is incorporated 5’ adjacent to a proline 
residue which terminates a predicted alpha helix. All three 
modified residues have some favorability to form alpha helices; 
therefore, it is likely that the modification either continues the 
alpha helix one residue or has a secondary structural effect. 
Overall, AroB is predicted to be a highly structured protein; 
therefore, all modified residues will interact with existing sec- 
ondary structures. For example, in the 5’ region containing 
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Fig. 4 Aligned CAMEOS AroB and InfA outputs. (a) The eight highest scoring CAMEOS designs incorporated 
InfA within AroB within two regions, at the 5’ and 3’ ends. (b) AroB designs aligned to the wild-type AroB 
sequence showing the locations where residues were modified. (c) InfA designs aligned to the wild-type InfA 
sequence showing the locations where residues were modified 


3.7 Putting It All 
Together 


1 
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InfA designs, the existing structure is a combination of alpha 
helices and beta sheets, while the 3’ region containing InfA 
designs is populated with predominantly alpha helices with a 
single beta sheet. Due to its short size, InfA had more signifi- 
cant modifications to its sequence as on average 45% of the 
amino acid identities were altered (Fig. 4c). Unlike AroB, InfA 
has an experimentally characterized structure which is domi- 
nated by a beta barrel with a single short alpha helix 
[39]. Therefore, due to InfA’s small size and highly structured 
topology, all residue changes would be members of an existing 
beta sheet or alpha helix. 


. Differences in the co-encodings can also be seen when consid- 


ering the predicted translation efficiencies of ivfA from within 
the aroB sequence. Using the RBS Calculator [40] on the top 
five designs, we see a nearly eightfold difference between pre- 
dicted translation efficiency of the worst and best imfA designs. 
Similarly, there is a ninefold difference for aroB (Fig. 5a). The 
correlation between aroB and imfA translation initiation rates 
in this case is due to the N-terminal location of infA 
co-encoding. If imfA is co-encoded more C-terminally, the 
two CDS translation initiation rates (TIRs) are not connected, 
with aroB displaying a strong TIR (5577 AU) and infA dis- 
playing a range of TIRs (1-286 AU), although 20-5577-fold 
lower than aroB (Fig. 5b). For reference, in the z#fA natural 
E. coli genomic context, it has a predicted TIR of 1211 
(AU) which is at least fourfold higher than the best CAMEOS 
encoding. 


. As we have outlined, the successful completion of co-encoding 


two proteins in the same DNA sequence in different reading 
frames is a complex and multistep process. One of the most 
burdensome barriers to entry for molecular biologists is access 
to a computer running Linux and the installation of all the 
tools needed. To ease this process and enable scientists without 
strong computational backgrounds to use the CAMEOS algo- 
rithm, we have created a virtual machine which comes 
pre-loaded with all the tools used in this protocol and can be 
easily run on a computer using Windows or macOS operating 
systems. The virtual machine file is ~20GB large, so ensure you 
have plenty of room on your disk and a fast internet connec- 
tion. The disk size of the virtual machine is 50GB, so as you add 
data to your virtual machine, ensure you have at least 75GB 
free on your computer’s disk. Access the .ova file which con- 
tains the virtual machine (called “overlap”) here: https:// 
cloudstor.aarnet.edu.au/plus/s/800J]7SHTt463KE5. 


The starting password for the Ubuntu operating system on the 


virtual machine is overlap. 
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Fig. 5 Predicted translation efficiencies for CAMEOS designs. (a) Using RBS Calculator on the top five 
N-terminal designs of infA/aroB overlap, we see a strong correlation between aroB translation initiation rate 
(TIR) and the downstream infA TIR. This effect is likely due to the close proximity of the start codons and 
ribosome binding sites of the co-encoded coding sequences. (b) Using RBS Calculator on other aroB/infA 
co-encodings where the start codons and RBSs are spaced further apart shows no correlation between TIRs 
but does show, as before, a wide range of infA TIRs 
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We strongly suggest you change the password once you start 
using the virtual machine. 

To check if the overlap. ova file is downloaded correctly on 
macOS, start up a CLI such as Terminal in the directory you 
downloaded the file to and run: 


S$ md5 overlap.ova 


To check if the overlap. ova file is downloaded correctly on 
Linux, start up a CLI such as Terminal in the directory you down- 
loaded the file to and run: 


S$ md5sum overlap.ova 


To check if the overlap. ova file is downloaded correctly on 
Windows, start up a CLI such as Command Prompt in the direc- 
tory you downloaded the file to and run: 


C:\> certutil -hashfile overlap.ova MD5 
The result of these commands should be: 
b8£894d£3305507bdee6e992ac87d75£ 


If your result does not match, then it is likely that the download 
was interrupted and the overlap.ova file was corrupted. Please 
try to download again. In the future, if the link to this resource 
becomes broken, please check our lab website for details: https: // 
www.jaschke-lab.science/ 

The virtual machine .ova file (over Lap. ova) can be booted up 
using the free Oracle VM VirtualBox software. 


2. We have also created a script in the shell language Bash that can 
be used to accomplish all the previously described steps auto- 
matically, reducing the chances of human error from moving all 
these files around and using tools with certain parameters. This 
Bash script is called run_cameos.sh and is stored within the 
main/ folder of the CAMEOS code on the overlap.ova virtual 
machine. To just download the Bash script, please find it here: 
https://cloudstor.aarnet.edu.au/plus/s/QO 
IKhQAQCaNimdj 


To use the script to perform a run of CAMEOS, you need to 
open the run_cameos.sh script in a text editor and specify the 
protein names you are working on by changing the two variables 
specifying the protein names: 


proteinA=infA 


protein=aroB 
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4 Notes 


Then save and close the run_cameos.sh script file. Next, 
open a CLI in the main/ folder and run the script by: 


S$ bash run_cameos.sh 


As the script is running, it will display updates on which tool is 
being run and its progress in the CLI window. Once the run is 
done, it will display information on where the output files are 
located and how long the script took to run. 


1. More information on these coding sequences can be seen on 
the EcoCyc database [41] here: https://ecocyc.org/gene? 
orgid=ECOLI & id=EG10504 https://biocyc.org/gene? 
orgid=ECOLI & id=EG10074. 


2. Only E. coli sequences have been used with CAMEOS before; 
although in principle nothing is preventing other prokaryote 
coding sequences from being used, any sequence differing 
from the standard codon table or E. colt codon usage would 
need to manually optimize the code. 


3. Text to enter will be supplied with single quotes “text to be 
entered,” and the quotes should not be included unless specif- 
ically stated. 


4. Depending on the number of sequences, this process may take 
more than 1 h to generate the data. 


5. The FASTA sequences could also be downloaded programma- 
tically using the available Application Programming Interface 
(API) using Python, Perl, or JavaScript. InterPro makes this 
process easier by automatically generating the code needed, but 
this method is outside the scope of the current article. 


6. The HH-suite GitHub page has a detailed wiki page with 
examples of how to run their script and is accessible via 
https://github.com/soedinglab/hh-suite/wiki. 

7. https://mafft.cbrc.jp /alignment/server/add.html. 


8. Although available through Bioconductor for the R language, 
we used the CLI version available here from the original publi- 
cation: http://www. bioinf.ucd.ie/download/od-seq.tar.gz. 


9. Some multiple sequence aligners (e.g., MAFFT and FAMSA) 
create output with 50 or 80 characters per line separated by 
newline (\n) characters, which is not suitable for use in the 
CAMEOS script. 
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Design of Gene Boolean Gates and Circuits with Convergent 
Promoters 


Biruck Woldai Abraha and Mario Andrea Marchisio 


Abstract 


Gene digital circuits are the subject of many research works due to their various potential applications, from 
hazard detection to medical diagnostic. Moreover, a remarkable number of techniques, developed in 
electronics, can be used for the construction of biological digital systems. In our previous works, we 
showed how to automatize the design and modeling of gene digital circuits whose gates were based on 
transcription and translation regulation. In this chapter, we illustrate how Boolean gates could be imple- 
mented by following a particular architecture, the convergent promoter one, rather diffuse in nature but 
seldom adopted in Synthetic Biology. Beside gate design, we also explain how to extend our previous 
modeling approach, based on composable parts and pools of molecules, to quantitatively describe and 
simulate this particular kind of digital biological devices. 


Key words Boolean gates, Digital circuits, Convergent promoters, RNA polymerase II collision 


1. ‘Introduction 


Boolean gates are commonly used in electronics where they 
represent the basic components of digital circuits that are also at 
the basis of how computers work. Logic formulae are made of 
variables, referred to as Jiterals, organized into clauses. A logic 
formula corresponds to a digital circuit where the literals are the 
circuit inputs and the clauses are Boolean gates. Literals are binary 
variables, i.e., they can assume only two values: 0 (FALSE) and 
1 (TRUE). The NOT (~) operator permits to switch the value of a 
literal from 0 to 1 or vice versa. Clauses are connected via other 
logic operators: AND (A) and OR (V). The output of a logic 
formula/circuit is a binary variable as well. 

In logic terms, a truth table is an object that explains, in a 
complete and concise way, the relationship between the inputs 
and the output of a circuit. Any truth table can be converted into 
two equivalent logic formulae that, however, lead to slightly 
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different circuit schemes. One formula is written in the disjunctive 
normal form (DNF) or sum of products (SOP). Here, every clause 
contains logic multiplication among literals, and the output of the 
formula corresponds to the sum of the outputs of all the clauses. 
The other way of expressing a formula is the in the conjunctive 
normal form (CNF), where clauses contain sum of literals and are, 
in contrast, multiplied to each other (product of sums, POS). In 
principle, one initially defines, through the truth table, the function 
of a digital circuit. The truth table is then converted into both SOP 
and POS formulae via, for instance, the Karnaugh map method [1], 
and then, the formula that minimizes the number of gates is imple- 
mented in the lab. 

In previous works from our lab, we have shown that as long as 
the number of input signals is lower than or equal to 4, the Kar- 
naugh map method can be exploited for the automatic design of 
digital synthetic gene networks [2, 3]. With respect to the circuit 
schemes derived from POS formulae, those following SOP formu- 
lae reduce the number of gates in the circuit by one unit by 
requiring that each clause produces the circuit output, for instance, 
a fluorescence protein. This is the so-called distributed output 
architecture [4]. Although important, in order to select a scheme 
for a digital circuit, the number of transcription units is not the only 
parameter to take into account. Boolean gates are transcriptional 
units where the logic behavior depends on control of transcription 
and/or translation (see Fig. 1 for the symbols used throughout this 
chapter). In our previous works, this control could be achieved 


Symbols 
Antisense 
Promoter ad 
Coding Region (CDS) ci 


Reporter protein gene 


Incomplete reporter protein gene 


Terminator L 
siRNA 


dsRNA 


mRNA 


Fig. 1 Symbols used in this chapter for the design of genetic Boolean gates and circuits 


2 Methods 


2.1 Convergent 
Promoters 


Boolean Gates with Convergent Promoters 123 


mainly in three ways: at the promoter level via transcription factor 
proteins (TFs) and at the mRNA level via either small RNAs or 
riboswitches. Since the engineering of small RNAs is much easier 
than that of new proteins and riboswitches, moreover, interact 
directly with chemicals, we proposed a complexity score to evaluate 
the actual difficulty of implementing a circuit in the lab based on the 
number of TFs and small RNAs rather than genes in the circuit 
[2]. Slightly different was the criterion described in the later work 
by Gander et al. [5], where, moreover, CRISPR-dSpCas9 [6] had 
been employed and shown to be a powerful instrument to simplify 
the structure of logic networks. 

In Synthetic Biology, digital circuits are widely studied and have 
been engineered in different organisms because of their vast num- 
ber of applications, from biocomputing [7, 8] to biosensors 
[9, 10]. They can be used for medical diagnostic [11] or even as 
therapeutic devices [12]. 

In this chapter, we want to describe how to use convergent 
promoters in order to build synthetic gene Boolean gates and basic 
digital circuits. First, we are going to illustrate the molecular biol- 
ogy of this particular promoter configuration and how the RNA 
polymerase II collision, induced by this promoter architecture, can 
be exploited to mimic logic function. Then, we will describe how to 
design Boolean gates taking up to three inputs. Finally, we will 
show how molecular phenomena due to convergent promoters 
have been modeled in the past and propose our mathematical 
description of RNA polymerase IT collision within the framework 
of composable parts (see Note 1) that we developed for the modular 
design of synthetic gene circuits [13]. 


Convergent promoters represent a way transcriptional interference 
takes place in the cells [14]. As the name says, in this configuration 
two promoters face each other on the DNA and share part of their 
transcripts (see Fig. 2a). The different strength of the promoters 
determines which gene is transcribed in higher quantity. Beside 
convergent promoters, other promoter arrangements can lead to 
transcriptional interference such as tandem and overlapping pro- 
moters (Fig. 2b). 

Different mechanisms can determine transcriptional interfer- 
ence. Promoter competition occurs when RNA polymerase, by bind- 
ing a promoter, hinders the binding of other RNA polymerase 
molecules to a second promoter nearby (this can be the case of a 
tandem or an overlapping promoter, as in Fig. 2b). Occlusion is due 
to the transient, though frequent, occupancy of a promoter due to 
RNA polymerases coming from a close, strong promoter. Sitting 
duck refers to the situation in which RNA polymerase molecules 
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Fig. 2 Transcriptional interference. Among the main mechanisms that cause transcriptional interference, there 
are (a) convergent promoters and (b) tandem or overlapping promoters. The net result of transcriptional 
interference is a reduction in gene expression. In the figure, some parts are enclosed into blue frames to 
distinguish them from the other parts that are in the opposite orientation 


take too long to initiate transcription such that they are dislodged 
from the DNA from other RNA polymerases coming, once again, 
from a near “aggressive” promoter. Roadblock can be seen, some- 
how, as an opposite phenomenon to sitting duck. In this case, RNA 
polymerases are too tightly bound to the open complex formed 
during initiation such that RNA polymerases coming from an 
opposite promoter are push off the DNA. It should be noted, 
though, that roadblock can take place also far from a promoter if 
an RNA polymerase molecule is stalling on the DNA. Finally, 
collision is literally a clash between RNA polymerases elongating 
in opposite directions along the DNA. As a result, either just one or 
both RNA polymerases fall off the DNA and terminate transcrip- 
tion (see Fig. 3). Theoretical studies suggest that the rate of collision 
increases with the distance between and the activity of the conver- 
gent promoters [15]. In this chapter, we will consider only RNA 
polymerase collision as a phenomenon through which convergent 
promoters permit to mimic logic formulae. 

Convergent promoters have been already used in Synthetic 
Biology in a different context, namely, to reconstruct RNA inter- 
ference (RNAi) in S. cerevisiae. They appeared to be an efficient way 
to produce the siRNA precursor that is later processed by the Dicer 
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Fig. 3 Possible mechanisms that trigger transcriptional interference 


2.2 Boolean Gates 
Based on Convergent 
Promoters 


(first) and the Argonaute (later) into siRNAs (small interfering 
RNAs). Upon binding the Argonaute, siRNAs give rise to the 
RISC (RNA-induced silencing complex). siRNAs are designed to 
bind, by base complementarity, the mRNA of a target gene that is 
then cut by the Argonaute and finally degraded by the cell machin- 
ery. It should be noted that both the Dicer and the Argonaute 
genes are missing from the S. cerevisiae genome and have to be 
reinserted into its chromosomes or expressed upon transformation 
with either centromeric or episomal plasmids [16-18] (see Fig. 4). 


In order to describe the design of synthetic biological Boolean 
circuits, we are going to use the same formalism as in our previous 
works on this topic [2, 3]. Circuit inputs are chemicals that can be 
divided into two categories: inducers (7) and corepressors (c) 
[19]. The former activate transcription, the latter inhibit it. The 
circuits we depict in this chapter are transcriptional networks. 
Hence, input chemicals act on transcription factor proteins. They 
are either repressors (R) or activators (A) that can lie into two 
different states: active (R’, A”), i.e., able to bind the DNA, and 
inactive (R’, A’), i.e., uncapable of adhering to the double strand. 
Thus, inducers interact with R” (turning them into inactive ones, 
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Fig. 4 Convergent promoters and RNA interference reconstruction in S. cerevisiae. The RNAi circuit in figure is 
organized into four transcription units (TUs), one consisting of convergent promoters flanking a fragment of a 
reporter (fluorescent) protein. Convergent promoters lead to the synthesis of a long siRNA precursor that is 
processed, by the Dicer, into small double-stranded RNAs. They are loaded into the Argonaute where one 
strand is removed and the other permits the formation of the RISC: RNA-induced silencing complex. The siRNA 
binds, by base pair complementarity, the mRNA of the fluorescence protein and puts the Argonaute in the 
condition to cut the mRNA. Upon cleavage, the mRNA degradation pathway is activated. If the convergent 
promoters can be induced via a chemical, cell fluorescence could be regulated through the reconstructed RNAi 
pathways. Notice that, even though we are considering S. cerevisiae cells, compartments have been omitted 
from the figure 


R’) and A’ (changing their configuration into the active one, A”) 
since transcription can take place when repressors are detached 
from the DNA or activators are anchored to it. Following the 
same logic, corepressors dock either to R’ (making them active, 
R") or A’ (turning them into their inactive form, A’). Therefore, 
corepressors favor the binding of repressors and hinder the arrival 
of activators at the target promoters. Overall, the admitted reac- 
tions among the circuit inputs and transcription factors are 
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Fig. 5 NOT gates. Inducers permit to realize the NOT logic operation when acting on an antisense promoter. 
Like in every other figure throughout the chapter, green arrows represent transcription activation, whereas 
hammer-like red arrows stand for repression of transcription 
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Basic logic operations, i.e., YES (buffer gate in electronics) and 
NOT, are realized by joining two transcription units, one of which 
can be based on convergent promoters. They make use of the 
reactions in Eq. 1 and are illustrated in Figs. 5 and 6. 

An inducer, as an input, demands convergent promoters to 
realize NOT gates. As shown in Fig. 5, we suppose that the two 
promoters share the same CDS. The “sense” promoter drives the 
correct mRNA transcription and leads to the synthesis of a func- 
tional protein. The “antisense” promoter, in contrast, would not 
produce anything useful to the cell. Here, the antisense promoter is 
inducible. Hence, in the absence of the inducer, it is either switched 
off by the active repressor R® (upper right panel) or simply incapa- 
ble of synthesizing mRNA because it is not bound by the 
corresponding activator, whose wild-type state is inactive (A‘, 
lower right panel). As soon as the antisense promoter gets activated, 
RNA polymerase II binds to it and starts elongating along the 
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Fig. 6 YES gates. Corepressor molecules realize “buffer gate” when directed to an antisense promoter 


coding region. Hence, a collision with RNA polymerase II mole- 
cules coming from the sense promoter (constitutively active) takes 
place and decreases protein synthesis. A NOT logic behavior can be 
mimicked properly by tuning the strength of the antisense pro- 
moter, i.c., by maximizing the collision rate between RNA 
polymerase II. 

By using a corepressor molecule, in contrast, a convergent 
promoter is required to reproduce the YES behavior (Fig. 6). The 
antisense promoter is either activated by A” (upper left panel) or 
repressed by R‘ (a repressor that is inactive in its ground state, lower 
left panel). Hence, in the absence of c, transcription takes place 
from the antisense promoter, and the output (a protein) is not 
produced due to RNA polymerase collision, provided that the 
antisense promoter is strong enough. If the corepressor c is added 
to the system, either the activator becomes inactive or the repressor 
gets active. In both cases, the antisense promoter transcription rate 
should be reduced to the leakage level with a consequent increase in 
protein production. 

These are the ways convergent promoters can be employed to 
represent a literal and its negation. Figures 5 and 6 show also the 
alternative design based on “standard” transcription units contain- 
ing a single sense promoter. By combining these functionalities 


2.2.1 Two-Input 
Boolean Gates 


Table 1 
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properly, we will illustrate how to construct, with convergent pro- 
moters, basic Boolean gates accepting more than a single input. 


As we have seen in the previous section, the architecture of a YES or 
a NOT gate depends only on the kind of the input chemical and the 
promoter (sense or antisense) on which the chemical acts on. The 
nature of the transcription factor associated with the chemical does 
not play any particular role. Moreover, a convergent promoter is a 
multiplicative architecture that produces an output equal to 1 only 
when the sense promoter is activated, and the antisense promoter is 
inhibited. In all the other possible cases, the result is equal to zero. 
In other words, convergent promoters permit the design of multi- 
plicative gates. 

If we want to realize the AND gate, a A J, we necessitate a 
positive literal on both promoters. Hence, according to Table 1, we 
need an inducer targeting the sense promoter and a corepressor the 
antisense one. Each chemical can be associated with two different 
transcription factors. Thus, we have overall four possible schemes 
for an AND gate (see Fig. 7). 

Similarly, we can design four circuits representing different 
configurations of the N-IMPLY gates (both # A b and aA b) and 
the NOR gate (@\ b=aV 0d, the latter equation is derived by 
applying one of the De Morgan’s laws). In Fig. 8, an implementa- 
tion of these two kinds of gates is sketched. Interestingly, an 
N-IMPLY gate demands to use two molecules of the same type, 
i.e., either two inducers or two corepressors. It should also be 
noted that the NOR gate (together with the NAND one) is of 
particular importance since it is a universal gate, i.e., it allows the 
construction of every possible digital circuit, no matter its 
complexity. 

Other two-input gates correspond to more complex formulae. 
In this case, we can still make use of the convergent promoters by 
using the so-called distributed output architecture that, as men- 
tioned above, is a realization of the DNF (disjunctive normal form) 
or SOP (sum of product) circuit representation. An XOR gate, for 
instance, returns 0 when both inputs are identical and 1 when they 
are different. It can be expressed as: (aA b) V (aA b). The OR 


Assigning input chemicals—and corresponding transcription factors—to the sense and antisense 
promoter in order to have positive (YES) or negative (NOT) literals 


System YES NOT 

i+ R° Sense Antisense 
i+ A’ Sense Antisense 
c+R Antisense Sense 

c+ A” Antisense Sense 
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Fig. 7 Possible four different implementations of an AND gate based on convergent promoters. The columns 
named “pol >” and “< pol” refer to RNA polymerase II elongating from the sense and the antisense promoter, 
respectively. The symbols “>” and “<” indicate the direction in which RNA polymerase II flows. “X,” in 
contrast, means that RNA polymerase II cannot start elongation from the corresponding promoter 


operation between the two N-IMPLY clauses is obtained by requir- 
ing that each multiplicative gate produces the same molecule, i.e., a 
fluorescence protein, the circuit output (see Fig. 9). 

ANAND gate can also be designed by means of the distributed 
output architecture since the formula (a A b) becomes, via the De 
Morgan’s law, (@ V b). However, since we have two negated literals, 
if we want to make use of convergent promoters only, both a and 
b must be inducers, as shown in Fig. 10. 

Potentially, by assigning a single literal (input) to each pro- 
moter and making use, when necessary, of the distributed output 
architecture, every two-input Boolean gate can be designed via 
convergent promoters. However, not every configuration is possi- 
ble. As we have seen, N-IMPLY gates demands inputs of the same 
kinds, whereas AND gates can be constructed only when one input 
is an inducer and the other a corepressor. Obviously, more solutions 
can be achieved by mixing gates based on convergent promoters 
and traditional transcription units (promoter-CDS-terminator). In 
particular, complex logic formulae would arise by combining NOR 
gates as in Fig. 8 with classical inverters (NOT gates). 
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Fig. 8 Possible schemes of an N-IMPLY and a NOR gate 


2.2.2 Three-Input 
Boolean Gates 


Two-input gates based on convergent promoters are designed by 
taking into account only the kind of signals acting on the two 
promoters. In contrast, “traditional” two-input Boolean gates, 
containing the sense promoter only, are built by placing two opera- 
tors, i.e., the sequences where transcription factors bind, along the 
(sense) promoter. Hence, the logic behavior strongly depends on 
the interaction among the DNA and the transcription factors 
[20]. For instance, two inducers 7; and 7 acting on two active 
repressors, R{ and R5, give rise to an AND gate. In contrast, if 
they act on two inactive activators, Aj and A}, they produce an OR 
gate unless a strong cooperativity is required for the simultaneous 
binding of the two proteins to the DNA: in this improbable case, 
the logic function would be an AND gate [21]. 

If we want to use convergent promoters to build more complex 
logic gates, accepting, for instance, three inputs, then the interac- 
tions among the transcription factors binding the same promoter 
cannot be neglected. As we have mentioned in the Introduction, 
there are many different, possible, ways to realize Boolean gates. 
Here, we consider only pure transcriptional gates, where a logic 
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Fig. 9 An XOR gate based on the distributed output architecture between two transcription units containing 
convergent promoters. The two clauses, and the relative convergent promoters, have been named A and B to 
fully illustrate, in the truth table, the working of this rather complex design 


function arises from the action of transcription factors on promo- 
ters. Since each input chemical is associated with a different protein, 
we can neglect any kind of hetero-cooperativity (which are quite 
rare). Hence, two repressors switch off transcription independently 
on each other. Similarly, two activators turn on protein synthesis 
separately. Finally, if an activator and a repressor bind the same 
promoter simultaneously, the repressor wins. 

Before starting the description of three-input Boolean gates, it 
is worth noting that, inside a complex logic network, the inputs ofa 
gate are transcription factors that do not interact with any chemi- 
cals. Previously, we have pointed out that a promoter regulated by 
two active repressors, Rf and R5, behaves as an AND gate if the 
two repressors are inactivated by different chemicals, 2; and 7. In 
contrast, if no input signal is able to inhibit Rf and R45, then a 
transcription unit with this configuration would become a NOR 
gate, able to drive protein synthesis only in the absence of the two 
inputs, the repressors in this case. 
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yAh=l4VbL 


Fig. 10 A NAND gate made of two separate convergent-promoter devices. This design is possible only if the 
inputs are both inducers 


The key point in building three-input gates based on conver- 
gent promoters is the multiplicative logic interaction (AND) 
between the sense and antisense promoter. Let us suppose to send 
two inputs to the sense promoter and one to the antisense pro- 
moter. If we want to implement a three-input AND gate, then we 
have to combine a two-input AND gate on the sense promoter with 
a positive literal (YES gate) on the antisense one. The AND gate on 
the sense promoter can be realized in the way we just mentioned 
above: two inducers inhibiting their corresponding repressors. As 
for the YES gate on the antisense promoter, we cannot obtain it 
with a third inducer molecule, but we need a corepressor acting, for 
instance, on an inactive repressor protein. The overall scheme is 
shown in Fig. 11. 

An alternative design would require an inducer acting on the 
sense promoter and two corepressors binding active activators 
controlling the antisense promoter (see Fig. 12). It should be 
noted that, by substituting an active activator with an inactive 
repressor, the overall three input device would no longer behave 
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(i, A b A C3) 


Fig. 11 A three-input AND gate. The AND behavior between i, and i is realized easily on the sense promoter 
and then coupled with a corepressor, C3, that carries out a YES function on the antisense promoter. Essential 
for the behavior of the whole gate is the multiplicative relation between the two promoters 


as an AND gate as shown in Fig. 13. Other gates with different kind 
of complexity can be designed, potentially, in the same way fol- 
lowed so far. Moreover, considering the synthetic gene digital 
circuits in the literature, it is not recommendable to regulate pro- 
moters with more than two inputs. 

In general, the convergent promoter architecture offers inter- 
esting novel solutions for Boolean gate implementation. However, 
as mentioned above, it might be too complex to build an entire 
digital circuit with gates containing only convergent promoters. 
Hybrid gates, mixing transcription and translation regulation, 
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(i, A 3) A C3) 


Fig. 12 Alternative design for a three-input AND gate. Differently from Fig. 11, the sense promoter hosts a YES 
gate combined with a two-input AND gate on the antisense promoter. Also in this case, the three input signals 
are not all of the same type. Moreover, it is essential that the corepressors act on active activators 


might give rise to new interesting and useful architectures. Further- 
more, different gate configurations could be combined within cell 
consortia [22] to make more feasible the implementation of intri- 
cated genetic networks. In general, many schemes can be designed 
to represent the same digital functions. However, methods for 
accurate performance prediction and evaluation of construction 
complexity are not fully established yet. 
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Fig. 13 Role of the transcription factors in determining a logic behavior. With respect to the AND gate in 
Fig. 12, we exchange AS with A: This modification was enough to complete spoil the AND gate by giving rise 
to a different logic formula 


2.3 Modeling In the previous section, we have described possible rules for 
Transcription designing gene Boolean gates based on convergent promoters. 
Interference Their working depends on a particular kind of transcriptional inter- 


ference, 1.e., RNA polymerase (II) collision. Clearly, since we cared 
about architectural aspects only, we assumed that RNA polymerase 
II collision was always highly effective such that RNA polymerase II 
molecules “fired” by the antisense promoter were able to stop 
transcription completely after clashing with those coming from 
the sense promoter. In reality, it is not clear how to predict the 
result of such a collision. Indeed, not necessarily both RNA poly- 
merase II molecules fall off the DNA, and, depending on several 
factors such as the relative strength of the two promoters and the 
distance between them, the probability that one of the two RNA 
polymerases results unaffected by the collision and goes on synthe- 
sizing mRNA might be even rather high. 


2.3.1 The Model by 
Sneppen and Co-authors 
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One of the first detailed mathematical model of transcriptional 
interference in E. coli was presented by Sneppen et al. [15]. Even 
though this framework relies on experimental results from bacteria 
and, as we will see, it is probably not the most appropriate for a 
theoretical representation of our Boolean gates, it is interesting to 
see which kind of parameters and “events” have been taken into 
account to explain transcriptional interference. 

Initially, the strength of the sense and the antisense promoters 
were evaluated separately. As for the antisense promoter (termed, in 
the paper, the sensitzve promoter but referred here to as p,,), RNA 
polymerase (RNAp) is supposed to give rise to a sitting duck 
complex (SDC, which is related to the low transcription initiation 
rate from fs) via an irreversible reaction with rate-constant k*, 


RNAp + p,, “3 SDC (2) 


It should be noted that this is a different reaction from the 
usual binding/unbinding of RNA polymerase to the promoter, 
which is indeed a reversible reaction. The sitting duck complex is 
then described to fire an “elongation complex” EC (see Note 2) 
with firing rate-constant k;* 


ks 
SDC — EC (3) 


The steady state of a single, isolate antisense promoter is char- 
acterized by a balance between the formation of the sitting duck 
and the elongation complex. This leads to the definition of a new 
quantity, the promoter fraction occupancy 

Rs 
OS = as as (4) 
fon ae ke 

We can estimate the strength of the antisense promoter (K**, 
i.e., the equivalent of the transcription initiation rate) by multi- 
plying the firing rate-constant by the fraction occupancy we just 
calculated. This turns out to be equal to 

RS Rs 
as as fas f 
Ke = ke & = Poe (5) 


on 


As for the sense, “aggressive,” promoter, f,n, the transcription 
initiation rate, K"", is not associated with any particular formula. 
The average front-to-front time between two successive RNAps 
fired by p,, corresponds to the inverse of K" (1/K*"). Sneppen 
et al. pointed out that K*" depends on what they called self-occlusion 
time, which actually corresponds to the clearance time, i.e., the 
time RNAp takes to leave the active site and move from the pro- 
moter to the nearby DNA sequence (e.g., a ribosome binding site, 
in bacteria). The clearance time (#4) is calculated as the ratio 
between the length (/) and the speed (v) of the RNAp without 
the sigma factor (i.e., the «2{’complex). Therefore, we have that 
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2.3.2 Occlusion 


Reese? (6) 
tel l 

What the authors consider as the most important quantity to 
characterize the sense promoter for studying transcriptional inter- 
ference is the “gap time” (¢,), i.e., the average time between the 
back and the front of two consecutive RNAp molecules fired by Psy 
(in other words, the time taken by two successive front RNAps to 
reach p,,). This corresponds to the time traveled by the first RNAp 
(1/K*") minus the clearance time associated with the following 


RNAp (//7): 


sn l 
pete ee Eg (7) 
gs k sn y K™ p k sn 
A new rate-constant, termed K%", is defined as the inverse of t, 
k sn 
K= = 
* 8 


Using numerical values estimated from experiments on E. colt 
cells (not relevant for the analysis in this chapter), it is possible to 
show that, for weak promoters, it holds that K" = K". 


In the model by Sneppen et al., three kinds of transcriptional 
interference are taken into account (occlusion, sitting duck, and 
collision) and appear to be strongly intertwined. Promoter occlu- 
sion interferes with the formation of the SDC at the antisense 
promoter. RNA polymerases fired by the sense promoter occlude 
the antisense promoter during their elongation on the DNA. 
Therefore, in order to have an SDC at p,,, the gap time K% 


should be long enough to guarantee RNAp binding at p,,. The 


probability (y) that p,, is free from ECs originated by p,, is propor- 
tional to the ratio between the gap time and the “total” time (¢,) 
traveled by the RNAp fired by py (E= x=) multiplied by the 
probability P, that the actual gap time (f,) (i.e., the gap time to 
the next arriving RNAp) is long enough to permit the binding of a 
full RNAp (o2’). 

P, is quantified as 


1 2 eu 
Ps == =p eS pe (9) 
Overall, we can write that 


_ a eee _ = sn l Ki ta 
X= Pe = (1 K se (10) 

As every probability, y € [0,1]. In particular, y = 0 is verified 
when K$" approaches infinity, i.e., the gap time tends to zero and 
the SDC cannot be formed (complete occlusion). In contrast, y = 1 


2.3.3 Sitting Duck 
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when K%$"=0 such that the gap time approaches infinity and, 
therefore, there is no occlusion. 

Equation 10 leads to a redefinition of the rate-constant in Eq. 2 
from k=. to yk*,,. Hence, transcriptional interference due to occlu- 


on on’ 
sion changes the antisense promoter strength K* in Eq. 5 into 


k OCC 
| Bes he 
K's = f on ll 
OCC he + yk ( ) 


Finally, the transcriptional interference due to occlusion, Iocc, 
is quantified as the ratio between K* and K6cc 
as as 
ke a5 XFon (12) 
as as 
x (ee =F ae 
From Eq. 12, we can see that the minimal value of [occ is 
1. This corresponds to the theoretical case of y = 1, i.e., the absence 
of occlusion. 


Tocc = 


Transcriptional interference takes place, mainly, at the antisense 
promoter that is considered, in this framework, as much weaker 
than the sense one. If, in the occlusion-based interference, RNA 
polymerase coming from /,, hinders the binding of RNAp at the fs 
and prevents the formation of a sitting duck complex, in the sitting 
duck interference we have that RNA polymerase fired by the sense 
promoter removes an SDC that had been assembled at the antisense 
promoter. Moreover, Sneppen et al. made the assumption that, 
whenever such a clash takes place, the SDC is always destroyed. 
Furthermore, they reckoned that that the time gap, as estimated 
above, is probably shorter if we consider that the formation of the 
transcription initiation complex at p,, and the subsequent firing of 
an RNAp are much faster reactions than the SDC formation. 
Therefore, they introduced the rate-constant K%', > K%" as a cor- 
rection to the previous system description, which shall be included 
in the definition of promoter fraction occupancy. Hence, Eq. 4 is 
rewritten as 


wks 

9% = as oy sn 13 
kon + kee + Ke 
where also the probability y has been considered. Under this 
hypothesis, the strength of the antisense promoter becomes, due 

to the presence of bot occlusion and sitting duck interference, 
pen 
+ ke + KE (14) 


on 


as _ 
k OCC+SD ~~ 
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2.3.4 Collision 


The overall transcriptional interference due to occlusion and 
sitting duck is then given by 


k as wks a ke + K sn 
Ree on ok 
K6ccisp Xx (ke ar ks) 


If we set y = 1, i.e., no occlusion, we can quantify the interfer- 
ence due to the only sitting duck 


Tocc+sp = (15) 


k as K™ 
Iepn = =l4 ie 
sae «LE a 


i.e., it is directly proportional to the shortness of the “corrected” 
gap time. 


RNA polymerase (II) collision is the most important kind of tran- 
scriptional interference in the design of our Boolean gates. At least 
to understand how a Boolean gate would work, we supposed that, 
in case of clash, the two RNA polymerases would fall from the 
DNA. Sneppen et al. gave an estimation of the probability P, that 
an RNAp fired by the antisense promoter reaches the sense pro- 
moter by escaping a collision with another RNAp fired by the sense 
promoter. They observed that P, shall depend on three main fac- 
tors: (1) the time 4, = d/v (where dis the distance between the two 
promoters and y the speed of RNAp) that the EC from p,, takes to 
arrive at ~.,3 (2) the fact that no RNAp shall be fired by #,,, during £,; 


and (3) the corrected gap time zr. Overall, P, can be expressed as 
a 1 23 sn 
| ediaet ae te — © ae (17) 


a t. 
euRe en 


Hence, a collision is avoided in two just hypothetical cases: 
when either K$" = 0 (no RNAp is fired from the sense promoter) 
or t, = 0,1.e., the distance between the two promoter is null. 

By taking into account the probability of escaping a collision, 
the “total” strength of the antisense promoter K*¥ becomes 


KT = K6ccusp gk (18) 


and the total transcriptional interference (Jy) in a convergent pro- 
moter system is calculated as 


K a8 sn 
IT= KS = Iocc+sp e*** (19) 


Sneppen et al. pointed out, however, that the equations derived 
above get increasingly inaccurate if the strengths of the two pro- 
moters are similar and the distance between the two promoters 
becomes large. They underlined that collision gives a substantial 
contribute to transcriptional interference when p,, is very strong 
and the distance from p,, is over 200 nt (this, at least, in E. colz). 


2.4 Modeling and 
Constructing Logic 
Gates via 
Transcriptional 
Interference 
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The modeling framework described above was extended by Bordoy 
and Chatterjee [23] that combined antisense transcription with 
antisense regulation, i.e., phenomena such as inhibition or attenu- 
ation of translation and mRNA degradation. In a more recent work, 
Bordoy et al. [24] applied transcriptional interference to the con- 
struction of bacterial Boolean gates taking up to two inputs (AND 
and OR). They focused on only two kinds of transcriptional inter- 
ference, i.e., those due to roadblock or tandem promoters. More- 
over, they analyzed their circuits via a different approach, with 
respect to that in [23], based on the Shea-Ackers method [25], 
already applied in the modeling of synthetic gene networks. Even 
though their Boolean gates do not contemplate RNA polymerase 
collision, it is interesting to see some conclusions that arose from 
the experimental and theoretical results presented in this work. 
An AND gate was designed by means of the roadblock mecha- 
nism by placing a lac operator (lacOp) downstream of a 
tetracycline-inducible promoter containing two tet operators 
(tetOp, see Fig. 14). In principle, only in the presence of both 


A) Structure of a TU regulated by roadblock. 


tetOps 


= @ T 


lacOp 


B) Roadblock takes placed in the presence of tetracycline (tet) and the absence of IPTG. 
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Fig. 14 An AND gate based on roadblock. (a) Structure of the TU representing the AND gate. (b) Roadblock 
process that takes place on the transcription unit in (a). Here, RNA polymerase collides with Lacl and falls off 


the promoter 
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tetracycline (which docks to TetR and prevents it from binding any 
of the two tetOps) and IPTG (that interacts with LacI and inhibits 
its capability of binding lacOp), the synthesis of a fluorescent 
protein takes place. In contrast, in the presence of tetracycline and 
the absence of IPTG, RNAp can bind the promoter and start 
transcription that is then aborted due to the clash with Lacl 
bound to the DNA. Interestingly, however, a proper AND gate 
behavior depends on many parameters (e.g., the distance of lacOp 
from tetOp, the dissociation rate of LacI) that need to be fine- 
tuned in order to get the desired logic behavior—as previously 
pointed out in [26]. To this aim, a mathematical model that allows 
to carry out parameter optimization is in need. In order to simulate 
the dynamics of this system, it is necessary to know the fraction of 
the two free proteins, 1.e., not bound to the corresponding chemi- 
cals and, therefore, able to bind the DNA. Since TetR and Lacl 
behave in the same way with respect to their inducer molecules and 
the DNA, let us consider a generic protein P that can lie into two 
states: active (J), i.e., free from any inducers and, thus, able to bind 
the DNA, and inactive (P), i.e., bound to m inducers (i) and no 
longer able to anchor to the DNA. We supposed to be at steady 
state such that the total concentration of P (Py) does not change, 
ie., Pp= P+ P. 

Let us consider that there is complete cooperativity among the 
inducers such that we can write 


: hy 
ni+ PoP (20) 
where k is the association rate-constant between inducers and 
proteins. The reaction is reversible; hence, we can also write that 
re . F 
PSni+r (21) 


where k_, is the dissociation rate-constant of the P* complex. 
The dynamics of * is obtained by solving the following ordi- 
nary differential equation 


ar? 
at 


If we used the steady-state condition (which implies the con- 
servation of Py), then we have that 


0 =k P+ k_\ (Pr — P*) (23) 


=—-hPP+k {Pi (22) 


After a few algebraic steps, Eq. 23 can be rewritten as 
‘a 1 
fe =p Tt i 
1+ (x) 


7 (24) 
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which gives the fraction of the protein P that are active ( fp*), i-e., 
not associated with the chemical 7 and, therefore, able to bind the 
DNA. Eq. 24 is a Hill function where Ky; is the dissociation 
constant between 7 and P at the equilibrium. In contrast, the 
fraction of proteins that are bound to i ( fp’) and cannot get access 
to the DNA is given by 


(xis) 
Ru, 
fpal-fs=— xe (25) 

Therefore, the status of protein P with respect to the inducer 
zis described by the Hill functions in Eqs. 24 and 25. They hold for 
both TetR (together with tetracycline) and LacI (IPTG). 

As mentioned above, Tet promoter and Lac operator occu- 
pancy is calculated via the Shea-Ackers method. Promoter occu- 
pancy (also referred to as the transfer function TF) is calculated as a 
ratio between the states (binding events) that allow transcription 
and all the possible states in which the promoter can lie. Hence, in 
the Shea-Ackers approach, the transfer function is the probability to 
have transcription from the promoter under exam. 

In the AND gate by Bordoy et al., the transfer function of the 
Tet promoter is given by a fraction where the numerator contains 
the only binding event that gives rise to transcription, i.e., 
K, - RNAp, where K, is the association constant between RNA 
polymerase and the promoter. The denominator of the transfer 
function contains, in contrast, all the possible binding events, i.e., 
RNAp bound to the promoter (like in the numerator), TetR to a 
single operator, and TetR to both operators. To quantify the TetR 
binding events, the probability that TetR is free and active (as in 
Eq. 24) shall be taken into account. If we use the notation K,, to 
indicate the association constant between TetR and tetOp, the 
transfer function for the Tet promoter (TFy,) becomes 


K,RNAp 
(1 + (K:RNAp) + 2(KaTetRfy,) + (KaTetRf;,)’) 
(26) 


TF yp = 


where fr, corresponds to fp, in Eq. 24. 

The other component of the AND gate is the lac operator. 
Here, the transfer function TF;, is much easier to calculate since 
no transcription can start from this location (i.e., the numerator is 
equal to 1) and in the denominator we have only the case of LacI 
bound to lacOp; therefore 


1 


TFLo = K,Laclf,, 


(27) 


where K,, is the association constant between LacI and lacOp and 
fia is a Hill function like in Eq. 24 in which the inducer is IPTG. 
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2.5 RNA Polymerase 
I! Collision and 
Composable Parts 


2.5.1 RNApIl Collision 
Without Transcription 
Regulation: A Simple 
Transcription Unit 


Bordoy et al. [24] used the transfer functions in Eqs. 26 and 27 
to calculate the amount of green fluorescent protein at fixed con- 
centrations of the two chemicals. The easiest way was to sum them 
up and multiply them by a parameter # accounting for translation. 
However, this simple formulation did not work, and further cor- 
rective terms had to be added to have a model that could fit the 
data. As shown in the original work by Shea and Ackers [25], it is 
possible to use all the probabilities, which the approach requires to 
calculate, to write a system of ODEs, whose solution gives the 
dynamics of the biological system under study. 


The Boolean gates illustrated in Figs. 5, 6, 7, 8,9, 10, 11, 12, and 
13 are different from those implemented by Bordoy and Chatterjee 
[23] since only one gene is expressed, under the sense promoter, 
whereas the RNA polymerase II fired by the antisense promoter has 
the task to oppose gene synthesis. Here, we sketch a model for gene 
expression under RNA polymerase II collision on a convergent 
promoter system based on the formalism of composable parts 
[13]. RNA polymerase II plays a fundamental role in this modeling 
framework since it supposed to be stored into a pool connected to 
every TU inside a circuit. Fluxes of RNA polymerase II are 
exchanged between the RNApII pool and the circuit TUs. More 
in general, genetic circuits are supposed to work thanks to the fluxes 
of molecules referred to as common signal carriers [27]. RNApII 
goes through any DNA part, synthesizes the pre-mature RNA, and 
goes back to its pool. By dealing with synthetic DNA sequence, we 
can neglect the presence of introns and the action of the spliceo- 
some on the pre-mature RNA [28]. A single reaction can lump all 
the necessary steps for mRNA maturation and transport to the 
cytoplasm where translation takes place. If the protein is a transcrip- 
tion factor, it will be then imported back to the nucleus. A reporter 
protein can be considered as stored into its own pool in the cyto- 
plasm (see Fig. 15). 


As we have explained in the previous section, the main feature of 
the “composable part” approach is to treat transcription as a result 
of the RNApII flux through the DNA. This flux is not continuous 
because RNApII “jumps” from a complex to another one that can 
lie or not into an adjacent part. We do not have to explicitly 
introduce an antisense promoter in our model but only a new 
species, the antisense RNApII (RNApIlIa, fired by the antisense 
promoter), and a complex where RNApIIa and RNApII (from 
the sense promoter) meet and clash. This complex leads to RNApII 
drop off the DNA (see Fig. 16). As it will become clear below, 
RNApIIa is, basically, a fake species in this model (see Note 3). 

RNApII interacts with the sense promoter P, the only one 
explicitly present in the model, giving rise to the complex [RNA- 
PII-P] 
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Fig. 15 Transcription and translation in eukaryotic cells according to the “Parts & Pools” (or composable part) 
framework [30]. The production of a reporter protein demands a flux of RNApll from its pool to the promoter 
upstream of the fluorescent protein and, then, down to the whole DNA TU up to the terminator, where RNA 
polymerase Il gets free and goes back to its pool (dashed arrows represent flux of molecules). Pools are clearly 
an abstraction for the place where certain molecules are stored in the cell. The same idea of fluxes applies to 
translation as well, where ribosomes flow through the mRNA and the reporter proteins, upon synthesis, move 


to their pool in the cytoplasm 


RNAplI + P “4 [RNApII — P] 


[RNApII — P| “+ RNApII + P 


Then RNApII has two possibilities: one is moving to the CDS, 
making a complex with it, and leaving P free, such that other 
molecules of RNApII can bind the DNA 


[RNApII — P] “ P+ [RNApII — CDS] 


the other possible event is a clash with an RNApII molecule fired by 
the antisense promoter (RNAplIIa). First, the two RNApIIs are 
supposed to form a complex and leave the promoter free (with 
rate-constant k., c stands for clash). Then RNApII leaves the 
DNA (and goes back to its pool), whereas the antisense RNApIIa 
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Fig. 16 Graphical representation of the reactions leading to pre-mature mRNA (pm) synthesis and RNApIl 
Collision in the framework of composable parts 


produces a non-sense RNA that we call z/. This explains why, as 
stated above, RNApIIa is not a real species in this model 


[RNApII — P] “3 P + [RNApII — RNApIIa] 
[RNApII — RNApIIa] “3 RNApII + nil 
RNApII bound to the CDS can reach the terminator 


[RNApII — CDS] “8 [RNApIT — T] 


Here, transcription ends with RNApII leaving the DNA and 
releasing the pre-mature mRNA, pm 


[RNApII — T] “5 RNApII + pm 


The pre-mRNA undergoes maturation and transport to the 
cytoplasm. The overall process is simplified in a single reaction 


pm ‘ mRNA 


2.5.2 Modeling a 
NOT Gate 
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In the cytoplasm, the ribosomes (7) bind the mRNA forming a 
complex ([rm]) before starting the synthesis of the reporter protein 


(F) 
r+mRNA “4 [rm] 


[rm] yy 4 mRNA 


[rm] eS 4+ mRNA + F 


To complete the model, we shall take into account all the 
necessary degradation process. We suppose that RNApII and the 
ribosome do not decay, the DNA (i.e., the promoter) neither. As 
for the other species, we have 


dom 
pm ~ 


ey Rdnit 
nil + 


kdn 
mRNA => 


[rm] = r 


FS 

The meaning of the rate-constants in the reactions above, 
together with their numerical values and those of other parameters 
used in our simulations (e.g., species initial concentrations, com- 
partment volumes), is given in Table 2 (see Note 4). They are based 
on our previous work [20]. 

Variations in the value of k mimic different transcription 
strength of the antisense promoter with their repercussion on 
gene expression. By running a “Parameter Scan” with COPASI 
[29], we can see that the number of reporter protein decreases 
quadratically with respect to increasing values of k (from 0 to 1; 
see Fig. 17). In particular, k, = 1 s-* determines an about 61% 
reduction, in fluorescence expression, with respect to the absence 
of RNApII collision. Therefore, according to this picture, an anti- 
sense promoter stronger than the sense one (kr = 0.65’) does not 
completely suppress the production of protein F. 


By following this approach, let us see how to model the NOT gate 
in Fig. 5, where the antisense promoter is repressed by an active 
repressor R“ that interact with an inducer 7. In the absence of 7, the 
antisense promoter is OFF and, therefore, the reporter protein is 
produced in high quantity. R® gets completely inactivated when 
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Table 2 


Species, rate-constants, and other parameters used in the model of RNApll collision due to 
convergent promoters 


Name 
Va 
Ve 


(atte 
hoy 


Patgorem, nil, m, rm) 


hag 


Meaning Value Unit 
Nuclear volume 2.9E—15 L 
Cytoplasmic volume 3.91E—-14 L 
Promoter 1 Molecule 
RNA polymerase II 1000 Molecule 
Ribosome 3000 Molecule 
RNApII - promoter binding 50,000 M's! 
RNApII - promoter unbinding 0.1 sme 
[RNApII-CDS] complex formation 0.5 swt 
Clash rate-constant Arbitrary s 
RNApII — DNA drop off rate-constant 0.5 s 7 
[RNApII-T] complex formation 0.5 Sue 
Transcription rate-constant 0.6 S- 
Maturation rate-constant 5.5E—4 (30 min) s 
Ribosome — mRNA binding 35,000 Mie cone 
Ribosome — mRNA unbinding 0.015 Se 
Translation rate-constant 0.02 a 


RNA decay rate-constant 


Reporter protein decay rate-constant 


5.7E—4 (20 min) 
8.25E—05 (140 min) 


exposed to an abundant concentration of 7, such that the antisense 
promoter is accessed by RNApII and collision between RNA poly- 
merase IT molecules can take place on the DNA, with a considerable 
reduction in fluorescence expression. In our modeling approach, k, 
can no longer be treated as a fixed number, as in the previous 
section, but shall depend on the degree of repression of the anti- 
sense promoter. We express kas a function of both R” and R’ such 
as 


R? 
“1+ R* 
where &, is a constant that varies with the antisense promoter (but 
shall not be confused with a parameter that quantifies the promoter 
leakage) and takes into account the promoter strength and the 
affinity with R*. 


kh, =k 


Reporter F (molecules) 
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Fig. 17 Reporter protein number as a function of k,. An increase in the clash rate-constant determines a 
quadratic decrease in the fluorescent protein number 


With respect to the model for the unregulated transcription 
unit, we shall consider also the production of R”, its interaction 
with the inducer 7, and the degradation of both R*® and R’. The 
inducer concentration can be treated as a constant. Moreover, for 
the sake of simplicity, we do not model a circuit made of two 
transcription units, one encoding for R” and the other for F, but 
describe R® synthesis as a zero-th order reaction 


ID ya 
—¥ R* 


whereas the interactions between repressors and inducers are 
given by 


Aaj 
i+ R® > R’ 
i hs a 
R'>i+R 
Inducer concentration is fixed at the beginning ofa simulation, 


and the decay of the chemical is neglected. In contrast, the repres- 
sors are degraded 


kdya 
ia 


R* 


kd. 
—- 1 


R? 


The rate-constants used in the above model are explained in 
Table 3. There, also their numerical values, used in our simulations, 
are given. 
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Table 3 
Parameters, and corresponding values, used to simulate a NOT gate based on RNApll collision 


Name Meaning Value Unit 

hPra R’ production rate 3.6E—10 M/s 

A R’ - inducer binding 1E+09 M's! 

7 R’ — inducer dissociation 1 s- 

kh, Antisense promoter “strength” 0.3 = 

kya Active repressor degradation 2.7E—4 (40 min) sg? 

kd, Inactive repressor degradation 2.7E—4 (40 min) Se 
10000.00 


2983.21 2982.15 2972.52 2872.39 


1237.75 
1000.00 
100.00 
10.00 
2.75 2.69 


1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 


Inducer concentration (M) 


Reporter proteins (molecules) 


Fig. 18 NOT gate based on convergent promoter architecture. The number of reporter proteins is shown in 
logarithmic scale 


As shown in Fig. 18, the NOT behavior of the circuit is appar- 
ent. The logic “O” output is reached at a concentration of 10 pM of 
the inducer (input = 1), at which less than 4 molecules of F are 
present in the system, compared to the almost 3000 (output = 1) in 
the absence of 7 (input = 0). 


2.5.3 Modeling a Two- The NOT gate we have just analyzed can be easily turned into the 

Input NOR Gate NOR gate in Fig. 8 by substituting the constitutive sense promoter 
with an activated one. In its ground state, the sense promoter P is 
basically OFF (any leakage is here neglected) and is turned into a 
functional, active, configuration, P*, upon binding the active acti- 
vator A”. A”, however, is inactivated when the corepressor c is 
present in the system at a proper concentration. 


3 Conclusions 
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With respect to the model for the NOT gate, nothing changes 
in the description of the flux of the antisense RNApII. In contrast, 
the “current” of sense RNApII is modulated by the corepressor 
molecules. Overall, a model for this two-input gate requires only a 
few more reactions, based on mass-action kinetics, i.e., the active 
activator synthesis 


IP aa 
—y A” 


the binding and unbinding of the activator to the DNA 
A°+P4 P* 
pt? A" +P 


the interaction between the activator and the corepressor (both 
when the activator is bound to the DNA or free in the nucleus) 


c+ A* + A’ 
Ais c+ At 
ep PPP At 
and the degradation reactions 


kdaa 
AY 


i kd 
A’ +c 


kdp« 
seat 


P 


An overview of the parameters we used in this model is given in 
Table 4. 

Figure 19 shows how the NOR gate is faithfully reproduced by 
our convergent promoter design. As shown in the truth table, the 
“1” logic value corresponds to a concentration of 10 1M, as in the 
NOT gate. 


In this chapter, we have sketched how gene Boolean gates can be 
designed by means of a convergent promoter architecture, where 
we considered RNApII collision as the main mechanism responsi- 
ble for logic behavior. Potentially, complex logic circuits can be 
drawn in an automatic way by adapting the rules for gate design and 
composition, illustrated above, to the framework we developed in 
[2], where the only input the user has to supply is the circuit truth 
table. Moreover, we have also shown how our approach of model- 
ing genetic circuits by means of DNA composable parts and pools 
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Table 4 
Parameters, and corresponding values, added to the NOT gate model in order to simulate a NOR gate 
based on convergent promoters 


Name 


[Dd 
ip 
5 


A® production rate 


A* — promoter binding 


A® — promoter dissociation 


A* — corepressor association 


A®* — corepressor dissociation 


A®* — corepressor binding on the DNA 


Active activator degradation 


Inactive activator degradation 


Active activator degradation on the DNA 


Value Unit 
3.6E—10 M/s 
1E+06 M's 
0.1 st 
1E+09 Ms? 
il sl 
1E+06 Ms 
2.7E—4 (40 min) sae 
2.7E—4 (40 min) Sua 
2.7E—4 (40 min) Se. 


Reporter proteins (molecules) 


1.00E+04 


1.00E+03 


1.00E+02 


1,00E+01 


1.00E+00 


1.00E-01 


1.00E-02 


1,00E-03 


1.00E-04 


1.00E-05 


2781.10 
3.17 
0.05 
| 5.98E-05 
nae aie — aa 
inducer/corepressor 
i (M) 

2781.10 

0.05 

3:17 

1.00E-05 1.00E-05 5.98E-05 


Fig. 19 Performance of a NOR gate based on convergent promoters. The y axis, reporting the number of 
fluorescent proteins in the cell, is in logarithmic scale. The 1-to-0 ratio of the device is of about 877-fold 
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of molecules can be adapted to the simulation of RNApII collision 
by, basically, considering a single parameter, which we termed k., to 
describe the action of the antisense promoters under regulation of 
repressors or activators. A properly estimation of k, for promoter- 
transcription factor pairs might lead to predictive, uncomplicated 
models for even intricated synthetic gene networks. 


1. The piece of software corresponding to this method has been 


2. EC corresponds to what we called Pof! in our method for the 
modular modeling of genetic circuits with composable 


3. The pre-mRNA produced by RNApIIa—here called ni/—is 
not necessary and can be omitted from the model. 


4, The initial amount of all the species that are not present in 
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Computational Methods for the Design of Recombinase 
Logic Circuits with Adaptable Circuit Specifications 


Ana Zuniga, Jerome Bonnet, and Sarah Guiziou 


Abstract 


Synthetic biology aims at engineering new biological systems and functions that can be used to provide new 
technological solutions to worldwide challenges. Detection and processing of multiple signals are crucial for 
many synthetic biology applications. A variety of logic circuits operating in living cells have been imple- 
mented. One particular class of logic circuits uses site-specific recombinases mediating specific DNA 
inversion or excision. Recombinase logic offers many interesting features, including single-layer architec- 
tures, memory, low metabolic footprint, and portability in many species. Here, we present two automated 
design strategies for both Boolean and history-dependent recombinase-based logic circuits. One approach 
is based on the distribution of computation within multicellular consortia, and the other is a single-cell 
design. Both are complementary and adapted for non-expert users via a web design interface, called CALIN 
and RECOMBINATOR, for multicellular and single-cell design strategies, respectively. In this book 
chapter, we are guiding the reader step by step through recombinase logic circuit design, from selecting 
the design strategy fitting to their final system of interest to obtaining the final design using one of our 
design web interfaces. 


Key words Recombinase, Logic, Synthetic biology, Web interface, Automatized design, History- 
dependent, Boolean, Single cell, Multicellular consortia 


Glossary 
— Compact: A design in which the number of parts needed to 
perform a function is reduced to its minimum. 


— Automatic design: Theoretical design performed via software, 
sometimes through a web interface. 


— Portable: Implementable in various organisms. 


— Scalable: The design principles developed at a given scale (e.g., a 
certain number of inputs) are applicable to a larger scale (here 
for an increasing number of inputs). 


— Complete: Capable of implementing all logic functions. 


— Reusable: The parts developed can be used for the construction 
of other circuits. 
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Introduction 


Synthetic biology consists in the engineering of new biological 
systems with the aim of (i) further understanding biology and 
(ii) providing new technological solutions to worldwide challenges 
such as climate change and healthcare. Examples of such engi- 
neered biological systems include building synthetic metabolic 
pathways in yeast to produce drugs in a more affordable manner 
[1, 2], developing synthetic live bacterial therapeutics [3-6], 
and engineering functional living biomaterials providing new solu- 
tions to healthcare challenges [7-9]. 

In nature, cells adapt to their environment by sensing and 
processing myriad signals and performing actions accordingly. Sim- 
ilarly, synthetic biological systems rely on the detection and inte- 
gration of multiple endogenous or exogenous signals for 
multiplexed biosensing [10], bioproduction of complex chemical 
compounds [11, 12], or production of biopolymers that can 
respond to change in their environment [8]. 

Synthetic biologists have mimicked electronic circuits to imple- 
ment cellular devices built from biological molecules that can pro- 
cess multiple signals (Fig. la). In this context, the main approach 
treats molecular or physical signals as binary inputs (which can have 
two different states, like in electronics), and cellular processing 
devices are assimilated to logic circuits. While this chapter is focused 
on digital logic circuits, analog logic circuits have also been imple- 
mented in living organisms [13]. 

To implement logic circuits, numerous molecular mechanisms 
have been used, such as transcription regulators [14-17], RNA 
molecules [18, 19], and site-specific recombinases [20-23]. Here, 
we focus on implementing logic circuits using recombinases, spe- 
cifically the family of serine integrases [24]. Serine integrases are a 
tool of choice for large logic circuit implementation. Numerous 
orthogonal serine integrases have been characterized [25] and have 
already been implemented in numerous organisms such as bacteria, 
plants, and mice [26]. Serine integrases recognize two DNA sites 
and recombine DNA between these two sites depending on their 
relative orientations, leading to an inversion of the DNA if the sites 
are in opposite orientation and to an excision if the sites are in the 
same orientation (Fig. 1b). In recombinase logic circuits, each 
input induces the expression of a recombinase, while circuit output 
is the expression of a reporter or production of a compound of 
interest. To implement logic functions using recombinases, promo- 
ters, terminators, and output genes are combined in a specific 
manner with integrase sites to condition the expression of the 
output gene to a particular combination of inputs (Fig. Ic) 
[20]. Recombinase logic circuits of up to six inputs have been 
implemented [22], and various design strategies have been used 
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Fig. 1 Recombinase-based logic circuits. (a) Biological logic system. Multiple signals (either environmental or 
endogenous) are detected by a cell. Each analog signal is converted into a binary signal. In this example, 
signal B and C are considered 1 as being above a defined threshold and signal A is considered 0. Then, a logic 
Circuit (here implementing the logic function (A OR B) and NOT C) processes these signals and produces a 
specific output. Biological logic systems are used to engineer biomaterials, biosensors and control protein and 
metabolite production. (b) Recombinase switch. Expression of a serine integrase is controlled by the input 
signal. Integrase recognizes two integrase sites: attB and attP sites. If the sites are in opposite orientation (left 
side), the DNA between the sites (here the promoter) is inverted leading to two new sites: attL and attR. The 
integrase alone cannot mediate recombination between attL and attR sites. If the sites are in the same 
orientation (right side), the DNA between the sites is excised, leading to a single integrase site, either attL or 
attR sites. (c) Example of a recombinase AND gate [20]. The AND logic device is composed of one promoter, 
two asymmetric terminators surrounded by integrase sites in inversion orientation, and a gene. In the absence 
of input, the output gene cannot be expressed as the RNA polymerase is blocked by the two terminators. In the 
presence of input 1 or input 2, integrase 1 (turquoise) or integrase 2 (orange) is expressed, and the terminator 
surrounding their corresponding sites is inverted. The output gene is still not expressed as one asymmetric 
terminator is still blocking transcription. Both inputs need to have been present to have both terminators 
inverted and then expression of the output gene, implementing an AND gate. (d) Example of the two-input 
history-dependent scaffold. Integrase sites are positioned to permit the expression of output genes in the 
corresponding lineage. For each state of the lineage, a different gene is expressed. The gene 0 is 
only expressed when no input is present. If input 2 is present first, gene 1 is expressed. If input 1 is present 
first, no gene is expressed (nor will be expressed) as the promoter is excised. If input 1 follows input 2, gene 
2 is expressed 


[20, 22, 23, 25, 27]. Integrase-mediated recombination is irrevers- 
ible in the absence of cofactor, and recombinase logic devices 
exhibit permanent memory. Consequently, inputs are considered 
ON if they have been present at any time in the circuit history. 
Recombinase logic devices thus implement what is called “asyn- 
chronous” logic because the inputs can be applied asynchronously. 

If integrase sites are not interleaved, the output of the system 
will be the same independently on the order of occurrence of the 
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inputs, implementing Boolean asynchronous logic. This type of 
logic circuit is of interest for biosensing in which delayed readout 
can be necessary [28 ]. 

If the integrase sites are interleaved, recombination reactions 
can influence each other, and the output of the system can be 
different depending on the order of occurrence of the inputs; in 
this case the logic implemented is history-dependent. These types 
of programs are ubiquitous in biology, being involved in funda- 
mental processes like cell division (checkpoints), differentiation 
(cell-fate commitment), and development as well as microbial sur- 
vival strategies by providing fitness advantage in the evolutionary 
competition [29-31]. History-dependent programs can be used as 
temporal and spatial trackers for the decoding and encoding pro- 
cesses such as development [32 ]. 

We present in this chapter design frameworks for the imple- 
mentation of asynchronous Boolean and history-dependent logics. 

The design of recombinase logic circuits is challenging as it 
does not follow electronic logic standards. Interestingly, complex 
logic functions can be implemented within a single layer; for exam- 
ple, an XOR logic gate can be built using a terminator surrounded 
by two pairs of integrase sites in inversion orientation [20]. While 
circuits can be designed by hand for a small number of inputs, the 
task becomes daunting as the number of inputs and possible part 
combinations increases. Thus, accessible software tools for design- 
ing recombinase-based logic circuits are needed. Similar efforts 
have been done for repressor-based logic circuits (CELLO) 
[14]. We developed two computational methods for designing 
recombinase logic circuits. Each of them provides a different 
approach to systematize recombinase circuit design. The first 
design strategy called CALIN (Composable Asynchronous logic 
Integrase Networks) allows the implementation of logic circuits 
by distributing the computational labor through a multicellular 
consortia, using a limited number of standardized logic devices 
that can be mixed and matched [23, 27]. CALIN enables the 
implementation of Boolean and history-dependent logic; both are 
scalable to five inputs. 

The second design strategy, called RECOMBINATOR, uses a 
database of devices generated in a combinatorial manner within 
which architectures implementing a particular function can be 
found. The RECOMBINATOR strategy aims at implementing 
logic within a compact and single-layer device operating in single 
cells, using an ad hoc design for each case [33]. The RECOMBI- 
NATOR database is limited to devices implementing Boolean logic; 
however, the same strategy is applicable in a straightforward man- 
ner to history-dependent logic. 

The two different strategies with their automatized computa- 
tional design methods are complementary; they have different 
properties which can be advantageous depending on the context 


Table 1 
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Type of design required according to the application type 


Time of _‘ Physical 


Fields Applications usage confinement # strains Type of design 
Biosensing In vitro medical Short use, Confine Unlimited Multicell 
diagnostic one 
Environmental shot 
diagnostic 
On-site Long Free Medium Single or multicell 
environmental term depending on input 
diagnostic number 
Therapy Therapeutic Low, Single cell 
bacteria better = 1 
Environmental Reduce to _ Single cell 
bioremediation 1-3 
strains 
Metabolic Production by Medium Medium Medium Single or multicell 
engineering fermentation term confinement depending on input 
number 
of implementation. Therefore, choosing between one and the other 
will depend on the particular specifications determined by the user 
(see Table 1 for a few examples). 

The objective of this book chapter is to provide guidelines on 
how to design recombinase-based logic circuits using multicellular 
or single-cell designs, following the CALIN or RECOMBINATOR 
strategies, respectively (Fig. 2). First, we describe how to choose 
which design strategy to use according to the device specification, 
and then we explain how to define and write down the logic 
function to implement. Finally, we show how to use the two web 
interfaces to obtain the final logic design. 

2 Methods 
2.1 Circuit Depending on the application, the user, and the complexity of the 
Specification logic function, one design should be preferred over the other. The 


multicellular approach allows for a systematic and modular design 
by applying distributed multicellular computation using a reduced 
number of already characterized and composable biological com- 
ponents (Fig. 2). However, this approach can lead to cellular con- 
sortia composed of a high number of different cells, with issues of 
stability. Additionally, to maintain a consortium composed of dif- 
ferent strains, a confined environment is required. The single-cell 
approach enables a compact design and avoids competition pro- 
blems between strains but leads to more ad hoc designs which 
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2.2 Logic Function 


2.2.1 Boolean Logic 


Systematic and Ad-hoc design 
modular design oo 


Modularity 


Compactness 


Go Multicell Single cell 


Fig. 2 Comparison of the two logic circuit design strategies. In recombinase logic 
circuit design, modularity is usually inversely proportional to compactness. The 
multicellular design strategy leads to composable highly modular circuits, but 
these circuits have low compactness as they require the assembly of a multi- 
cellular consortium. The single-cell design strategy leads to compact designs 
that can be implemented in a single cell by having low modularity as each device 
can be used in a very restricted set of situations and are more challenging to 
engineer as not following any design rules 


require more expertise and heavier engineering work to obtain 
devices operating as expected (Fig. 2). Table 1 lists some logic 
circuit applications with their specifications and the favorable type 
of logic circuit design to use. 

For users without much synthetic biology expertise, a multicel- 
lular design is favorable as the optimization process is straightfor- 
ward and existing engineered devices can be used [21, 23]. 
Similarly, for users requiring the implementation of numerous 
logic circuits, the composable multicellular design is more advanta- 
geous as the majority of functions can be implemented by mixing 
and matching the existing logic devices. Of note, depending on the 
logic function of interest, the multicellular design computational 
method can lead to a single-cell system. The two computational 
methods can therefore be performed in parallel and the final logic 
circuits compared to choose the final design. 

While the single-cell design strategy that we present here could 
be extended to history-dependent logic, it for now allows the 
design of Boolean logic functions only. 


To use the CALIN or RECOMBINATOR interfaces (for single or 
multicellular designs, respectively), users first need to determine the 
logic program to implement. 


For Boolean logic, logic programs can then be written as a Boolean 
equation (f(A,B) = A.B) encoding the output state. Since the 
establishment of logic by Aristotle and Boolean algebra by George 
Boole, various terms and notations have been used to converse on 
logical reasoning and write down Boolean equations (Table 2). For 
example, to express a gene only if A signal is present and not B 
signal, the Boolean function f(A,B) = A AND NOT(B) has to be 
implemented also written as A. !B . 

While there might be some debates on which notation is “cor- 
rect,” the use of a notation usually reflects the habits and usages of 


Table 2 
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Conversion from truth function to Boolean function 


Classic language Mathematical language Truth function Boolean function 
True True 1 

False False 0 

OR Disjunction Vv + 

AND Conjunction A 

NOT Negation = “or! 


2.2.2 History-Dependent 
Logic 


particular scientific communities (e.g., mathematics, informatics, 
etc.). Therefore, for a given application, the only important guide- 
line is to choose one notation, be consistent, and not mix the 
different notations together. 

In the RECOMBINATOR design interface, the Boolean func- 
tion must be written using + for OR, . for AND, and ! for negation. 

In the CALIN web interface, the logic function has to be 
written down as a truth table expliciting the output state (either 
0 for OFF or 1 for ON) in each input state. 


For asynchronous history-dependent logic, the notation is not as 
rigorously established. Here, we have to consider the relative time 
of occurrence of inputs. State machine diagrams have been used for 
this purpose. In our system, we have memory and cannot reset the 
system; therefore, the number of possible events occurring is dif 
ferent and lower than in most typical state machine diagrams. We 
represent the history-dependent programs as a lineage tree in which 
each branch, or lineage, corresponds to a specific order of occur- 
rence of the inputs (for two inputs: A and then B; B and then A). 
The number of lineages is equal to N!, where N is the number of 
inputs, for instance, two lineages for two-input programs and six 
for three-inputs. So each node of the tree corresponds to an input 
state, and we represent the output on each node by a number from 
0 to 9 (0 corresponds to no output and 1 to 9 to different outputs). 

Of note, in recombinase-based devices, inputs are decoupled 
from the logic implementation. Indeed, the identity of an input is 
defined by the conditional expression of an integrase, e.g., by the 
connection of an inducible promoter responding to a signal (input) 
of interest to an integrase. Therefore, by using a single logic device 
and switching the connection between inputs and integrases, vari- 
ous logic functions can be implemented in a very straightforward 
manner and without further optimization. Logic functions imple- 
mentable using the same logic device are equivalent when inputs are 
permuted and belong to the same P-class (where P stands for 
permutation) (Fig. 3). For example, the function A.not(B) is 
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2.3 The CALIN Web 
Interface for 
Multicellular Design 


2.3.1 Asynchronous 
Boolean Logic 


Input A Input B 


Fig. 3 Implementation of two logic functions belonging to the same permutation 
class (P-class) using one logic device and permuting the connections between 
integrases and inputs 


P-equivalent to the function not(A).B. We widely used this prop- 
erty in the CALIN and RECOMBINATOR web interface to 
reduce the number of logic devices required or generated. While 
here exemplified with a Boolean logic function, this property is the 
same for history-dependent functions. 


The CALIN web interface allows for the systematic design of logic 
circuits operating in a single layer as a multicellular system, not 
requiring cell-cell communications nor spatial separation. 


The algorithm starts by decomposing each logic function as a sum 
of products of NOT or IMPLY functions, called sub-functions. An 
IMPLY function corresponds to f(X) = X and a NOT function to f 
(X) = NOT(X) (Fig. 4a). Each sub-function is implemented in a 
single cell using a combination of IMPLY elements in series and 
NOT elements in parallel. IMPLY elements are composed of a 
terminator surrounded by integrase sites and NOT elements by 
promoters surrounded by integrase sites (Fig. 4b). 

After entering the number of inputs and the truth table 
corresponding to the logic function of interest; the web interface 
generates the biological logic design corresponding to the number 
of strains required, the genetic circuit layout for each strain, i.e., the 
connection of integrase genes with inducible promoters 
corresponding to each input plus the logic device (Fig. 4c). For 
each logic device, a DNA sequence corresponding to an optimized 
design for E. colz is also available. 

The web interface is based on a python script which allows the 
conversion of a Boolean logic function into a genetic logic design in 
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Fig. 4 CALIN automatized design strategy. (a) Workflow of logic circuit design. The input of the CALIN web 
interface is a truth table corresponding to the logic function of interest. The function is decomposed as a sum 
of product of IMPLY (such as f(X) = X) and NOT (such as f(X) = NOT(X)) functions, here: f = f1 + f2 with 
fi = NOT(A).NOT(B).C and f2 = A.B.C. Each sub-function is implemented in a single cell, and the composition 
of the f1 and f2 cells allows the implementation of the full logic function in a multicellular logic system. (b) 
Implementation of IMPLY and NOT functions using recombinase-based excision elements. IMPLY functions are 
implemented by surrounding by integrase sites in excision orientation a terminator placed between a promoter 
and the output gene. In the absence of input, the terminator blocks the expression of the output gene. In 
the presence of the input, the integrase is expressed, and the terminator is excised, leading to the expression 
of the output gene. The IMPLY logic element switches therefore from state 0 to state 1. NOT functions are 
implemented by surrounding a promoter by integrase sites in excision orientation. The output gene is 
expressed in the absence of the input; in presence of the input, the integrase mediates the excision of the 
promoter, and the output gene is not expressed anymore. The NOT logic element switches from 1 to 0 state in 
the presence of the input. (c) Output of the CALIN web interface: the logic device and integrase/inducible 
promoter cassette for each cell. The design of the logic devices computing the logic sub-functions is based on 
the composition of IMPLY and NOT logic elements. IMPLY logic elements are placed in series, while NOT logic 
elements are placed in parallel. The sub-function f1 (NOT(A).NOT(B).C) is composed of two NOT elements in 
parallel corresponding to the NOT(A).NOT(B) function (nested integrase sites in excision orientation surround- 
ing the promoter) and IMPLY element placed between the promoter and the gene corresponding to the C 
function. The sub-function f2 is composed of three IMPLY elements in series 


an automated manner. Here, we will detail this python script 
algorithm. 


1. The first step is to decompose the input Boolean function as 
independent sub-functions. To do so, we write the logic func- 
tion in its disjunctive normal form corresponding to a sum of 
products of input variables or their negations using the 
McCluskey algorithm [34]: 


F (15 vee yin eee ,tw) = in (11 4,,(«)) 


N corresponds to the number of inputs and M to the 
number of terms in the disjunction. ¢;, ; is either the IMPLY 
or NOT functions, such as @;, {x;) is equal to x; or not(x;). 
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2.3.2 History-Dependent 
Logic 


The McCluskey algorithm takes as input the ON and OFF 
output states corresponding, respectively, to the input states with 
a one or a zero as output. The algorithm provides an array of 
strings as an output, each corresponding to a sub-function. 


2. Each sub-function is translated into the corresponding logic 
and integrase device using our python algorithm. In this 
design, the number of logic devices is minimized. Indeed, 
functions belonging to the same P-class are implemented 
with the same logic devices, and only the connection between 
inputs and integrases is inverted. 

The logic device encoding each sub-function is obtained by 
extracting the number of IMPLY and NOT functions of the 
sub-function and by following the design rules detailed briefly 
above and described in detail in [27]. The integrase device is 
obtained by associating the integrase to the input, permitting 
the implementation of the desired logic function. 


3. The DNA sequence of the logic devices for E. colz is generated. 
The generated DNA sequence results from a hierarchical com- 
position of optimized logic elements [21]. Various permuta- 
tions of integrase sites have been characterized for each logic 
element corresponding to IMPLY and NOT functions with 
different integrases. Well-behaving IMPLY and NOT functions 
were selected and composed to obtain the 16 well-behaving 
logic devices permitting the implementation of all 4-input logic 
functions [21]. The same design strategy can be used to opti- 
mize logic devices for other organisms. 


In this case, the algorithm takes a lineage tree equivalent to a 
sequential truth table as input. The output corresponds to the 
biological implementation, such as for each strain: a graphical 
representation of the genetic circuit and its associated DNA 
sequences (Fig. 5). In the tree, each node corresponds to a specific 
state of the system in response to a different scenario: when no 
input occurred, when one input occurred, and when multiple 
inputs occurred in a particular sequence. 


1. The algorithm first decomposes the lineage tree into subtrees 
consisting of a single lineage containing one or multiple ON 
states. This decomposition is done by iteratively subtracting the 
lineages containing ON states (Fig. 5a). To obtain the lowest 
number of subprograms, the ones for which the highest num- 
ber of inputs occurred are prioritized in between the lineages 
with ON states (from the right to the left of the lineage tree). 


2. After decomposition, for each selected lineage, two pieces of 
information are extracted: the identity of ON states and the 
corresponding lineage. Based on these two pieces of informa- 
tion, the history-dependent logic device is constructed. The 
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Fig. 5 Automated design of history-dependent programs using the CALIN web interface. (a) Design algorithm. 
The python program takes as input a history-dependent program written as a lineage tree. This program is 
decomposed into sub-programs; the decomposition is performed by extracting in priority subprograms with 
an ON state at the extremity of the tree (corresponding to the state with a high number of inputs present). For 
each subprogram, the algorithm identifies the identity of an ON states and the order of the inputs in the 
lineage. Based on these two pieces of information, the biological design is obtained, including the graphical 
design of the integrase cassette and the history-dependent device, and the corresponding DNA sequence of 
the device. The full program design is obtained by composing the designs of each subprogram in different 
strains. (b) The CALIN web interface. Following the previously described algorithm takes as input the logic 
program as a lineage tree and gives as output the graphical design and DNA sequence of the device for each 
subprogram 


identity of the integrase sites is determined by the lineage and 
the position of the gene of interest (GOI) in the modular 
scaffold that executes the history-dependent programs occur- 
ring within a single lineage. More details can be found on how 
the modular scaffold allows the implementation of a single 
lineage program in [23]. The order of occurrence of inputs 
corresponding to each lineage is used to identify which sensor 
modules are needed among the different connection possibili- 
ties between control signals and integrases. 


3. Each device implementing one lineage is implemented in one 
strain. By each device, we obtain the global design for 
biological implementation of the desired history-dependent 
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2.4 RECOMBINATOR 
Database for Single- 
Layer Design 


Combination 
> > Permutation 


\Wy19- million times 


gene expression program (Fig. 5b). In the web interface, the 
biological implementation provided to the user consists of a 
graphical representation of the genetic circuit and the device 
DNA sequence of each strain (Fig. 5b). This automated design 
supports the implementation of all history-dependent pro- 
grams with up to five inputs. 


The maximum number of strains needed to implement an 
N-input/M-output history-dependent gene expression program 
is equal to N!, which corresponds to the number of possible 
lineages in an N-input lineage tree. However, most functions are 
implementable with fewer than the maximum number of strains, as 
corresponding to the number of lineages in which gene expression 
is required. Importantly, as the system does not use cell-cell com- 
munication, if one of the subprograms is ON, the global output of 
the system is considered to be ON. 


RECOMBINATOR is a database composed of ~19 million devices 
allowing single-cell implementation of all two- and three-input 
logic functions and up to 92% of four-input Boolean logic func- 
tions. This database was generated by combination and permuta- 
tion of recombinase sites, promoters, genes, and terminators 
(Fig. 6) [33]. 

A web interface allows the user to search the database: http: // 
recombinator.lirmm.fr. The user writes down their logic function 
of interest, either as a well-formed formula such as using the logic 
operators “ . + ! “ or as a binary number corresponding to the 
output state in each input state. 

Using the same example as previously, to express a gene only if 
signal A is present and not signal B: 


User input 


Logic function 
Biological constraints 


DATABASE 


User output 


List of 
architectures 


Fig. 6 RECOMBINATOR database and web interface. The RECOMBINATOR database was generated by 
combination and permutation of integrase sites, promoters, terminators, and genes. ~19 million architectures 
were obtained, each associated with the logic function they compute. The web interface allows searching in 
this database using as input a logic function and providing as an output a list of architectures with their 
specifications that can be sorted according to various biological constraints 
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Nb Promoters Nb Terminators Nb Asymetric Terminators Nb Parts Gene AtEnds Cross Promotion Constraint li 


0 0 6 no respected 
0 0 6 no respected 
0 0 6 no respected 
0 0 6 no respected 
0 0 6 no respected 
0 0 6 no respected 
0 0 6 no respected 
0 0 6 no respected 
0 0 6 yes respected 
0 0 6 yes respected 


Fig. 7 Example of a list of architectures generated by the RECOMBINATOR web interface for the logic function 
A AND NOT B (A.!B). The screenshot corresponds to the ten first listed architectures without applying any 
constraint or sorting criteria. For better visualization, the table has been truncated to the right showing only 
7 of the 12 criteria 


— The well-formed formula is A.!B. 


— The binary number is 0010 but can also be written as 0100 as 
the logic device design is agnostic with input identity as 
explained previously. 


After submitting the logic function of interest, a table is gen- 
erated with various designs, all theoretically allowing the imple- 
mentation of the input logic function and their characteristics 
(Fig. 7). These designs are called architectures and are abstracted 
versions of the final biological devices. Indeed, in an architecture, 
the identity (DNA sequence) of each part is not defined; only 
function encoded by the part is. 

Each line corresponds to one architecture represented by sym- 
bols (see Table 3 for correspondence between parts and symbols). 
The characteristics of each architecture are specified in the table 
generated by the web interface (Fig. 6). Each column corresponds 
to a particular feature, such as the number of genes; promoters; 
terminators; asymmetric terminators; parts; if the gene is posi- 
tioned at the extreme segment of the device; etc. 

Architectures can be sorted according to each of these criteria. 
It is also possible to filter them by applying some constraints: 
maximum and/or minimum constraints for number of 
parts, lengths, and on/off constraints for the Boolean criteria, 
which are cross-promotion (promoters facing each other) and 
gene at the end. 

For more details on each architecture, the view button at the 
extreme right of each line leads to a new page with all the char- 
acteristics of a specific architecture and the recombination state of 
the architecture for each input state (Fig. 8). Additionally, from this 
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Table 3 
Symbols used in the RECOMBINATOR web interface to represent each part in the different possible 
orientations 
Part Symbol 
Promoter in forward orientation IF 


Promoter in reverse orientation 


Terminator in forward orientation 


Terminator in reverse orientation a 


Gene in forward orientation G 
Gene in reverse orientation 9 

Sites in excision orientation [ ] 
Sites in inversion orientation ©) 


Recom bi nator Search architectures 


Implementable functions : a.!b (0010) Activation a b_ output 


[(G)]J 0 O O 


Architecture [(G)]4 

Text format [b (a GF )a]b PR A o 1.0 
Boolean function (minimal form) a.!b [e950] ee 
Boolean function (binary form) 0010 i : = 8 
Length 1200 bases 

Number of genes 1 

Number of promoters 1 

Number of terminators 0 

Number of asymetric terminators 0 


Maximum distance from promoter to gene 80 


Number of parts 6 
Gene at ends no 
Cross Promotion Constraint respected 


See architectures implementing the same function. 


Fig. 8 Detailed description of the properties of one architecture and its recombination intermediates in the 
RECOMBINATOR web interface. Screenshot of the webpage obtained from the view button of the first 
architecture in the architecture list for A.!B 


3 Conclusion 
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page, logic functions belonging to the same P-class are accessible 
with their corresponding architecture. Indeed, to go from the 
implementation of one logic function to a logic function belonging 
to the same P-class, only the identity of the integrase sites has to be 
changed (i.e., in the RECOMBINATOR web interface, the color of 
the integrase site pairs). 

A lot of information is provided to the user, and for most logic 
functions, a large number of architectures are available for their 
implementation. Passing from an architecture to a biological imple- 
mentation can be challenging and will require optimization; choos- 
ing the simplest architecture and the one most suited to the final 
chassis will increase the probability of successful biological 
implementation. 


We presented two design strategies to implement asynchronous 
Boolean logic programs in living organisms using serine integrases. 
These two design strategies are complementary: one is modular and 
scalable but requires a multicellular system while the other design is 
ad hoc such as can be more complex to engineer but is single cell. 

The RECOMBINATOR database and the CALIN inter- 
face allow the design of Boolean logic circuits up to five inputs. 
For CALIN, the python software supports the design of circuit 
with inputs higher than five and is available on GitHub (https: // 
github.com/synthetic-biology-group-cbs-montpellier/calin); we 
limited the web interface to five inputs to reduce lagging of the 
service. 

The automatization of the design of history-dependent pro- 
grams is only available with CALIN, which allows the implementa- 
tion of programs with up to five inputs and ten outputs. However, 
the strategy of the RECOMBINATOR database could be applied 
to history-dependent programs by allowing the generation of 
devices with interlinked integrase sites and dependent on the 
order of occurrence of inputs. The first challenge would be the 
size of this database which will significantly increase. 

Of note, we have experimentally validated the CALIN frame- 
work for both Boolean and history-dependent logic [21, 23], while 
the architectures provided by RECOMBINATOR are for now only 
theoretical. The large diversity and peculiarities of some of the 
designs will probably require the user to test several different 
architectures and optimize their behavior on a case-by-case basis. 

We hope that this book chapter will guide synthetic biologists 
as well as scientists from other fields to choose the more coherent 
design strategy for their specific application and facilitate the design 
of their logic devices using our design web interfaces. 


170 


Ana Zuniga et al. 


References 


1. 


10. 


ll. 


Galanie S, Thodey K, Trenchard JJ et al (2015) 
Complete biosynthesis of opioids in yeast. Sci- 
ence 349:1095-1100. https://doi.org/10. 
1126/science.aac9373 


. Paddon CJ, Westfall PJ, Pitera DJ et al (2013) 


High-level semi-synthetic production of the 
potent antimalarial artemisinin. Nature 496: 
528-532. https://doi.org/10.1038/ 
nature12051 


. Isabella VM, Ha BN, Castillo MJ et al (2018) 


Development of a synthetic live bacterial thera- 
peutic for the human metabolic disease phenyl- 
ketonuria. Nat Biotechnol 36(9):857-864. 
https: //doi.org/10.1038 /nbt.4222 


. Praveschotinunt P, Duraj-Thatte AM, Gelfat I 


et al (2019) Engineered E. coli Nissle 1917 for 
the delivery of matrix-tethered therapeutic 
domains to the gut. Nat Commun 10:5580. 
https://doi.org/10.1038/s41467-019- 
13336-6 


. Cui M, Sun T, Li S et al (2021) NIR light- 


responsive bacteria with live bio-glue coatings 
for precise colonization in the gut. Cell Rep 36: 
109690. https://doi.org/10.1016/j.celrep. 
2021.109690 


. Kalos M, June CH (2013) Adoptive T cell 


transfer for cancer immunotherapy in the era 
of synthetic biology. Immunity 39:49-60. 
https://doi.org/10.1016/j.immuni.2013. 
07.002 


. Bryksin AV, Brown AC, Baksh MM et al 


(2014) Learning from nature — novel synthetic 
biology approaches for biomaterial design. 
Acta Biomater 10:1761-1769. https://doi. 
org/10.1016/j.actbio.2014.01.019 


. Kalyoncu E, Ahan RE, Ozcelik CE, Seker UOS 


(2019) Genetic logic gates enable patterning of 
amyloid nanofibers. Adv Mater 31(39): 
e1902888. https://doi.org/10.1002/adma. 
201902888 


.Tang T-C, Tham E, Liu X et al (2021) 


Hydrogel-based biocontainment of bacteria 
for continuous sensing and computation. Nat 
Chem Biol 17:724-731. https://doi.org/10. 
1038 /s41589-021-00779-6 

Chang H-J, Voyvodic PL, Zuniga A, Bonnet J 
(2017) Microbially derived biosensors for diag- 
nosis, monitoring and epidemiology. Microb 
Biotechnol 10(5):1031-1035. https://doi. 
org/10.1111/1751-7915.12791 

Kim SG, Noh MH, Lim HG et al (2018) 
Molecular parts and genetic circuits for meta- 
bolic engineering of microorganisms. FEMS 
Microbiol Lett 365:fny187. https://doi.org/ 
10.1093/femsle/fny187 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22, 


23. 


24. 


Pham HL, Wong A, Chua N et al (2017) Engi- 
neering a riboswitch-based genetic platform for 
the self-directed evolution of acid-tolerant phe- 
notypes. Nat Commun 8:411. https://doi. 
org/10.1038/s41467-017-00511-w 
Sarpeshkar R (2014) Analog synthetic biology. 
Philos Trans A Math Phys Eng Sci 372: 
20130110. https://doi.org/10.1098 /rsta. 
2013.0110 

Nielsen AK, Der BS, Shin J et al (2016) 
Genetic circuit design automation. Science 
352(6281):aac7341. https://doi.org/10. 
1126/science.aac7341 

Macia J, Manzoni R, Conde N et al (2016) 
Implementation of complex biological logic 
circuits using spatially distributed multicellular 
consortia. PLoS Comput Biol 12:e1004685 
Gander MW, Vrana JD, Voje WE et al (2017) 
Digital logic circuits in yeast with CRISPR- 
dCas9 NOR gates. Nat Commun 8:15459. 
https: //doi.org/10.1038 /ncomms15459 
Anderson DA, Voigt CA (2021) Competitive 
dCas9 binding as a mechanism for transcrip- 
tional control. Mol Syst Biol 17:e10512. 
https: //doi.org/10.15252/msb.202110512 
Win MN, Smolke CD (2007) A modular and 
extensible RNA-based gene-regulatory plat- 
form for engineering cellular function. Proc 
Natl Acad Sci U S A 104:14283-14288. 
https: //doi.org/10.1073/pnas.0703961104 
Green AA, Kim J, Ma D et al (2017) Complex 
cellular logic computation using ribocomput- 
ing devices. Nature 548(7665):117-121. 
https: //doi.org/10.1038 /nature23271 
Bonnet J, Yin P, Ortiz ME et al (2013) Ampli- 
fying genetic logic gates. Science 340: 
599-603. https://doi.org/10.1126/science. 
1232758 

Guiziou S, Mayonove P, Bonnet J (2019) Hier- 
archical composition of reliable recombinase 
logic devices. Nat Commun 10:456. https: // 
doi.org/10.1038/s41467-019-08391-y 
Weinberg BH, Pham NTH, Caraballo LD et al 
(2017) Large-scale design of robust genetic 
circuits with multiple inputs and outputs for 
mammalian cells. Nat Biotechnol 35:453-462 
Zuniga A, Guiziou S, Mayonove P et al (2020) 
Rational programming of history-dependent 
logic in cellular populations. Nat Commun 
11:4758. https://doi.org/10.1038/s41467- 
020-18455-z 

Merrick CA, Zhao J, Rosser SJ (2018) Serine 
integrases: advancing synthetic biology. ACS 
Synth Biol 7:299-310. https://doi.org/10. 
1021/acssynbio.7b00308 


25. 


26. 


27. 


28. 


29. 


Yang L, Nielsen AAK, Fernandez-Rodriguez J 
et al (2014) Permanent genetic memory with 
>1-byte capacity. Nat Methods 11:1261-1266 
Fogg PCM, Colloms S, Rosser S et al (2014) 
New applications for phage integrases. J Mol 
Biol 426:2703-2716. https://doi.org/10. 
1016/j.jmb.2014.05.014 

Guiziou S, Ulliana F, Moreau Vet al (2018) An 
automated design framework for multicellular 
recombinase logic. ACS Synth Biol 7: 
1406-1412. https://doi.org/10.1021/ 
acssynbio.8b00016 

Courbet A, Endy D, Renard E et al (2015) 
Detection of pathological biomarkers in 
human clinical samples via amplifying genetic 
switches and logic gates. Sci Transl Med 
7(289):289ra83 

Byrne KM, Monsefi N, Dawson JC et al (2016) 
Bistability in the Racl, PAK, and RhoA signal- 
ing network drives actin cytoskeleton dynamics 
and cell motility switches. Cell Syst 2:38-48. 
https: //doi.org/10.1016/j.cels.2016.01.003 


Design of Recombinase Logic Circuits 


30. 


31. 


32. 


33. 


34. 


171 


Harmon B, Chylek LA, Liu Y et al (2017) 
Timescale separation of positive and negative 
signaling creates history-dependent responses 
to IgE receptor stimulation. Sci Rep 7:15586. 
https: //doi.org/10.1038/s41598-017- 
15568-2 

Wolf DM, Fontaine-Bodin L, Bischofs I et al 
(2008) Memory in microbes: quantifying 
history-dependent behavior in a bacterium. 
PLoS One 3:e1700. https://doi.org/10. 
1371 /journal.pone.0001700 

Guiziou S, Chu JC, Nemhauser JL (2021) 
Decoding and recoding plant development. 
Plant Physiol 187:515-526. https://doi.org/ 
10.1093 /plphys /kiab336 

Guiziou S, Pérution-Kihli G, Ulliana F, Leclére 
M (2019) Exploring the design space of 
recombinase logic circuits. bioRxiv 2019: 
711374 

Enderton H, Enderton HB (2001) A mathe- 
matical introduction to logic. Academic Press 


Check for 
updates 


Designing a Model-Driven Approach Towards Rational 
Experimental Design in Bioprocess Optimization 


Jing Wui Yeoh and Chueh Loo Poh 


Abstract 


To enable a more rational optimization approach to drive the transition from lab-scale to large industrial 
bioprocesses, a systematic framework coupling both experimental design and integrated modeling was 
established to guide the workflow executed from small flask scale to bioreactor scale. The integrated model 
relies on the coupling of biotic cell factory kinetics to the abiotic bioreactor hydrodynamics to offer a 
rational means for an in-depth understanding of two-way spatiotemporal interactions between cell beha- 
viors and environmental variations. This model could serve as a promising tool to inform experimental work 
with reduced efforts via full-factorial in silico predictions. This chapter thus describes the general workflow 
involved in designing and applying this modeling approach to drive the experimental design towards 
rational bioprocess optimization. 


Key words Bioprocess, Cell kinetic model, Computational fluid dynamics, Integrated modeling, 
Vanillin bioproduction 


1. Introduction 


To address global sustainability concern, microbial cells have been 
extensively utilized as cell factories to synthesize various valuable 
products [1]. However, transitioning from small lab-scale experi- 
ments to large industrial bioprocesses is often challenged by the 
non-homogenous mixing conditions encountered in bioreactors, 
which profoundly impacts the cell growth and bioproduction 
performance [2]. Understanding how the cells respond to the 
environmental variations temporally and spatially could pave the 
ways towards a more rational optimization of the bioprocesses. 
Despite all the previous modeling efforts [3-5], there is a lack of 
consensus on the established practices to drive the experiments 
from small-scale to large-scale bioprocesses. To achieve a more 
rational model-driven approach in fine-tuning of bioproduction 
performance, we have established a systematic model-driven 
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framework built upon an integrated model, working in parallel with 
rational experimental designs from flask to bioreactor scales, for the 
bioprocess optimization [6]. 

Using ferulic acid to vanillin biotransformation as a case study 
[7, 8], an integrated model coupling both cell factor kinetics and 
the bioreactor computational fluid dynamics in 3D has been devel- 
oped to assess the impacts of impeller rotational speed (RPM) and 
air supply rate (LPM) on the biomass growth and bioproduction 
performance. These variables are deemed to account for the overall 
aeration and mass transfer rate within the stirred-tank bioreactor 
[9]. This chapter describes the steps and general strategies involved 
in the experimental characterization studies from flask scale to 
bioreactor scale, elucidates the cell kinetic model development at 
different phases and the parallel working with experimental studies 
for validation, and finally outlines the setup of computational fluid 
dynamic (CFD) case and ways to analyze and visualize the results. 
This model-driven framework can easily be generalized to other 
bioproduction processes, which enables us to fully harness the 
intuitive and non-intuitive knowledge from experiments and trans- 
late into a quantitative model to be actively used in rational biopro- 
cess optimization across different phases [2]. 


2 Materials and Methods 


2.1 Plasmid 
Construction 


This section describes the details of materials and methods used in 
the corresponding subsections. In general, Fig. | illustrates a sys- 
tematic model-driven framework on how in silico modeling 
approach works in synergy with experimental studies across differ- 
ent phases (from small flask scale to larger bioreactor scale) for a 
rational bioprocess optimization. 


This section briefly describes the general methods involved in plas- 
mid construction such as plasmid design, assembly, and transfor- 
mation and highlights the different parts of the plasmid used in this 
study. 


e¢ Perform all plasmid designs and sequencing analyses using 
Benchling designer (Benchling, Inc. San Francisco, CA, USA). 


¢ Obtain the backbone plasmid pBbE8k (JBEI Part ID: 
JPUB_000036, colE1 ori, Kan’) from Addgene (Addgene, 
MA, USA). 

¢ Use arabinose-induced pBAD promoter with default ribosome 
binding site (rbsD) of strong relative strength to drive the 
feruloyl-CoA synthase (Fcs) gene which encodes the enzyme 


used to convert substrate ferulic acid to intermediate feruloyl- 
CoA. 
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Fig. 1 A systematic in silico model-driven framework working in synergy with experimental studies from flask 
scale to bioreactor scale to enable a rational optimization of bioprocess design. The top panel shows (left) 
samples of plasmid and experimental results obtained at flask studies (the performances when adding 
different substrate concentrations and the corresponding inhibitory effects) and (right) temporal profiles of 
different variables compared with results from cell kinetics and integrated models at specific RPM and LPM 
conditions and performances at different combinatorial conditions. The bottom panel illustrates the workflow 
of in silico model development starting from (left) developing a cell kinetic model to account for the principal 
cell variables, genetic circuit enzyme expression dynamics, and the bioconversion pathway of ferulic acid to 
vanillin. (Middle right) This is followed by the development of bioreactor geometry model for simulation using 
computational fluid dynamics. Integrating the cell kinetic model into the bioreactor CFD model allows the 
visualization of the temporal profiles and spatial distributions of variables across the entire bioreactor to 
examine the mixing effects. Full-factorial simulations under different combinatorial conditions can be 
performed to identify the optimal operating condition. (Parts of the figure adopted from [6] with permission) 


¢ Use aTc-induced pTet promoter with BBa_B0034 (rbs34) of 
medium relative strength to express the enoyl-CoA hydratase/ 
aldolase (Ech) gene that encodes the enzyme to transform 
feruloyl-CoA into product vanillin. 


e Perform Gibson assembly following the standard molecular 
biology techniques. 


e Transform the plasmid into TOP-10 chemically competent 
E. coli (Invitrogen) to be used for the bioconversion of ferulic 
acid to vanillin. 


¢ Culture the cells in minimal M9 media with 0.2% (w/v) casa- 
mino acids and 0.2% (v/v) glycerol as a sole carbon source. 


2.2 Growth Expression of heterologous enzymes could impose significant met- 
Decoupling Strategy abolic burden on cell vitality by redirecting the limited resource 
pool away from growth. It is thus important to decouple cell 
growth phase from the expression phase to maximize the biosyn- 
thesis yield without compromising the cell viability and 
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2.3 Flask Study 
Characterization 


productivity. This turns out to be indispensable when dealing with 
substrate /product or intermediates that are toxic to the cells. 


Grow the cells overnight and inoculate and incubate the cells in 
freshly prepared medium for 2-3 h. 


Inoculate the cells on flask to reach starting OD6go9 of 0.1. 


Grow the cells in control condition to identify the growth 
profile over time across different growth phases (lag phase, 
exponential log phase, slowdown phase, stationary phase). 


Repeat the process for new culture with the same starting 


Grow the cells until reaching the linear exponential growth 
phase, which is usually after 2-3 h from the start of the 
experiment. 


Add the two inducers to the culture to trigger the expression of 
the two heterologous enzymes required for the biotransforma- 
tion pathway. 


Continue growing the cells until reaching the slowdown 
growth phase. 


Add the substrate (ferulic acid in this case) to the culture to start 
the bioconversion process. 


Measure the bioproduction yield and productivity at the end of 
experiment and adjust the point for induction and substrate 
addition accordingly to determine the optimal protocol for 
maximal productivity and yield. 


Before implementing in large-scale bioreactor, experiments could 
be conducted at the flask scale to optimize the strain, medium 
composition, and bioproduction protocols and duration. More 
importantly, these small-scale experiments enable one to acquire 
preliminary quantitative data required for parameter inference of 
the cell mechanistic model which underpins more detailed model 
entailed at larger bioreactor scale. 


Perform experiments on flask scale at varying concentrations to 
determine the optimal glycerol supply and casamino acid con- 
centration required for optimal cell growth. 


Ensure that glycerol supply would be the limiting factor when 
cells reach stationary phase and other supplements should be in 
abundance for ease of regulation and modeling purpose. 


Conduct experiments subjected to different inducers (arabinose 
and aTc) concentrations to identify the optimal concentrations 
that drive the expression of the two enzymes to achieve the 
maximal bioproduction performance at minimal period. 


2.4 Bioreactor Study 
Characterization 
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Characterize the inhibitory impacts on growth profile when 
supplemented with different substrate or product concentra- 
tions, considering the potential bacteriostatic effects of phenolic 
compounds (ferulic acid) and aromatic aldehydes (intermediate 
feruloyl-CoA, vanillin). 


Use 60 ml minimal M9 media in 250 ml flasks and inoculate 
with 0.6 ml of E. cols in seed medium (overnight culture in 
Luria broth (LB)) supplemented with an appropriate amount of 
antibiotics kanamycin. 


Add inducers arabinose (0.2%) and aTc (200 nM) simulta- 
neously to the culture at the start of the exponential growth 
phase (at about 3.5 h for our case). 


Administer substrate ferulic acid dissolved in solvent dimethyl 
sulfoxide (DMSO) when the cells begin to enter the slowdown 
phase (at approximately 5.5 h in our study). 


Collect samples with a sampling volume of 1.5 ml at every 2-h 
interval. 


Measure the cell optical density OD value at 600 nm using a 
spectrometer (Eppendorf BioPhotometer Plus). 


Apply dilution method for OD above 2 to obtain a more 
accurate reading. 


Measure the ferulic acid and vanillin concentrations using 
HPLC (Shimadzu SPD-M20A Prominence Diode Array 
Detector) with mobile phase of 40% methanol and 60% (1% 
acetic acid). 


Measure glycerol concentration using HPLC (Agilent Technol- 
ogies 1260 Infinity Refractive index detector) with mobile 
phase of 0.005 M H2SOx. 


Perform the experiments in duplicate/triplicate and compute 
the average and standard deviation values. 


To better capture the interactions between the biomass growth and 
the experimental factors, bioreactor experiments are carried out at 
control (non-induction) and induced state to characterize the bio- 
mass growth and biotransformation performance when subjected 
to varying RPM and LPM combinatorial operating conditions 
which account for the overall aeration and mass transfer rate within 
bioreactor. 


Use a stirred-tank bioreactor with a single wall disk bottom 
vessel (Winpact Evo Fermentation System FS-07 series, Solid 
State Fermentation System FS-V-SAO05P) to study the batch 
culture system for bioproduction. 


Use a 1.5 L fermenter with a working volume of 1 L with a 
geometry of 10 cm inside diameter and 20 cm height. 
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Install two four-blade Rushton-type impellers with a six-hole air 
sparger above the bottom of the tank. 


Supply the air with an air pump with 0.22 um filter. 


Steam sterile the bioreactor filled with 1 L of M9 minimal media 
before running experiments. 


Set the temperature to 37 °C and set the aeration and agitation 
accordingly to keep it run overnight to allow the medium to 
reach dissolved oxygen (DO) saturation. 


Inoculate 20 ml of E. coli in seed medium (from overnight 
culture in LB) to the reactor with an appropriate amount of 
antibiotics kanamycin. 


Following the growth decoupling technique and similar to the 
flask studies, add the two inducers (0.2% arabinose and 200 nM 
aTc) upon reaching the cell exponential phase at about 3.5 h to 
trigger the expression of enzymes Fcs and Ech. 


Administer the substrate 0.1% ferulic acid in DMSO solvent to 
initiate the biotransformation process and left to run for 5 h; the 
experiment ends at 10.5 h. 


Collect 5 ml liquid sample at 2 h intervals for measurement. 


Measure the cell OD, glycerol, ferulic acid, and vanillin using 
the similar techniques mentioned for flask-scale studies. 


Conduct the bioreactor experiments at ten different dual-factor 
(RPM and LPM) combinatorial conditions as shown in Fig. 2 
(0 RPM-0.5 LPM, 0 RPM-3.5 LPM, 100 RPM-0.5 LPM, 
150 RPM-0.5 LPM, 225 RPM-0.5 LPM, 400 RPM-0.5 
LPM, 225 RPM-0 LPM, 225 RPM-1 LPM, 225 RPM-3.5 
LPM, and 400 RPM-3.5 LPM) for model validation while 
other factors are kept constant. 


LPM 


Fig. 2 A schematic diagram showing the different RPM and LPM combinatorial 
Operating conditions conducted experimentally, chosen rationally based on the 
sensitive regions from dose-response curves under different metrics (peak OD 
and productivity) 


2.5 Cell Factory 
Kinetic Modeling 


2.6 Flask-Scale 
Model Development 
and Validation 
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Here is to delineate the general flows or techniques used in devel- 
oping the cell factory kinetic model. More detailed model develop- 
ment and validation at flask scale and bioreactor scale will be 
elaborated in the subsequent sections. 


e The model simulations were implemented in MATLAB 
R2018b (MathWorks) for our study. 


e Derive the kinetic model formulation in the form of ordinary 
differential equations (ODEs) to describe the different cellular 
or environmental variables. 


¢ Solve the ODEs using numerical methods such as forward Euler 
approximation or MATLAB built in function odel5s. For 
dynamic profiles with fluctuations, forward Euler approxima- 
tion seems to provide a more stable and accurate results after 
tuning the time step, whereas odel5s provides higher compu- 
tational speed due to its adaptive characteristics of variable step 
and variable order but might not provide an accurate represen- 
tation of the result simulated in dynamic manner. 


¢ Apply global and/or local optimizers for parameter estimation 
when comparing model simulations against experimental data 
points. 


e In our case, function fminsearchbnd, which is a boundary con- 
strained local optimization algorithm based on Nelder-Mead 
simplex search method, was utilized to perform the parameter 
estimation given initial guesses and lower and upper boundary 
conditions. “None” can be used for those parameters without 
known boundaries. 


e The full-cell factory model encompasses the descriptions of 
biomass growth, nutrient consumption, dissolved oxygen 
dynamics, heterologous gene circuit enzyme expression, and 
enzyme catalytic biotransformation pathway. 


A primary step towards developing the integrated modeling frame- 
work begins with the development ofa preliminary cell mechanistic 
model which can capture the phenomena observed at small flask- 
scale experimental studies. This enables one to quickly come up 
with a coarse-grained yet informative model to quantitatively cap- 
ture the essential components, which could serve to optimize the 
experimental designs at the early phase. This section outlines the 
steps involved in the early model development and validation at 
flask-scale studies. 


¢ Develop a simple biophysical kinetic model as illustrated in 
Fig. 3 to capture the cell growth profile, genetic circuit enzyme 
expression (enzymes Fcs and Ech in our case), and the biotrans- 
formation pathway (from ferulic acid to vanillin formation). 
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Fig. 3 A preliminary cell mechanistic model consisting of simple growth model, genetic circuit model, and 
pathway-level model developed using flask-scale experimental data, which forms the basis for more detailed 
model at bioreactor scale 


¢ Asimple Verhulst rate growth model, which is a logistic growth 
formalism, was adopted in this study to describe the sigmoidal 
growth profile of a batch culture comprising three/four phases 
(starting short lag phase, exponential log phase, and slowdown 
phases followed by stationary phase), irrespective of the actual 
constraining factors that define the carrying capacity of the cells. 


e Describe the inducible enzyme expression of the genetic circuit 
by a system of nonlinear ODEs, where the rates of change in 
mRNAs and proteins are defined after applying the law of mass 
balance and the hill equation is used to describe the transcrip- 
tional control by inducers. 


e¢ Apply Michaelis-Menten equation to model the catalytic bio- 
transformation of ferulic acid into the intermediate feruloyl- 
CoA, finally leading to vanillin formation involving the enzymes 
Fes and Ech. 


e It is important to consider the ratio of the molar mass for the 
substrate, intermediate, and product into the equation to 
account for the difference in the measured concentration unit 
(molar). 


e Fit the model to the measured experimental data from flask scale 
to infer the various kinetic parameters, which can be used to 
determine the optimal point of induction and the duration of 
biotransformation run. 
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e¢ Determine the dose-response parameters using other flask-scale 
experiments such as examining growth profiles under varying 
glycerol concentrations and the inhibitory impacts of substrate 
or product on cell growth. The dose-response formulations and 
parameters derived from these data lay the groundwork for the 
model at bioreactor scale. 


¢ This model will form the basis for scaling up the cell model to 
incorporate other environmental factors controlled in bioreac- 
tor setting. 


2.7  Bioreactor-Scale § Moving towards the bioreactor scale, it is essential to account for 


Cell Model the relevant external environmental variations and the impacts 
Development and imposed on the cell behaviors and bioproduction performance. 
Validation Here is to highlight the strategies involved in developing the 


Glycerol 
(carbon 
source) 


RPM LPM Aerobic E Anaerobic @ 
\ Respiration: Respiration @®@ ef 
[/o. ogee [aa 
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two-way interaction detailed model as demonstrated in Fig. 4 sup- 
ported by rational experimental designs for model validation. 


¢ To better account for the different external environmental fluc- 
tuations observed in bioreactor, we move on to develop a more 
detailed model considering those impacts starting from the 
nutrient consumption and dissolved oxygen level at bioreactor 
scale. 


Ferulic Acid Vanillin 
(substrate) (product) 


Pathway Level Model 


CoA-SH 0 
ATP = AMP+PP'S S-COA |, 


0 Acetyl-CoA 


ee 
ech A, CH, 
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Fig. 4 More detailed cell model capturing the interactions with bioreactor environmental factors such as 


nutrient co 


nsumption, dissolved oxygen level under influences of varying RPM and LPM, and inhibitory effects 


imposed by toxic substrate and product 
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Apply the Monod-based formalisms to describe these two key 
environmental factors: glycerol as carbon source and dissolved 
oxygen dynamics. 


Validate the developed model with the measured experimental 
data collected from the batch culture of 1 L bioreactor study 
conducted in parallel under a specific operational setting, which 
is assumed to be the control condition without induction. 


To capture the two-way interactions which accounts for the 
impact of these fluctuating environmental conditions on the 
biomass growth, the earlier logistic-based model of biomass 
was modified to accommodate the nutrient- and oxygen- 
dependent aerobic respiration. 


In view of the facultative anaerobic nature of E. coli, anaerobic 
respiration can also be incorporated into the biomass equation 
to factor in the condition under low oxygen supply. 


It is also important to consider inhibitory effects on the biomass 
growth imposed by both substrate (ferulic acid) and product 
(vanillin) by incorporating the dose-response formulations and 
parameters derived from flask-scale experiments. 


To mimic the non-homogenous condition due to mixing, in 
this study, we focus on varying the two critical bioreactor para- 
meters: the impeller stirring speed RPM and air flow rate LPM, 
which are deemed to account for the overall aeration and mass 
transfer rate that have profound impacts on cell growth and 
bioproduction performance. 


To examine the combined impacts of these two parameters, 
different combinatorial experiments can be performed under 
the variations of the two determinants. 


To study the effect of LPM on biomass growth and biotransfor- 
mation performance, we can fix the RPM at the middle range 
such as 225 and then carry out bioreactor experiments with 
induction spanning across four different LPM values (0, 0.5, 
1s3.5); 

The metrics of biomass growth (peak OD) and biotransforma- 
tion performance (productivity, yield) can then be computed 
from the experimental measurements. 


Calculate the percent yield of the product vanillin as a ratio of 
actual measured yield to the theoretical yield in percentage. 


Compute the productivity as the gradient of product formation 
over the biotransformation duration. 


From the dose-response curve, identify the most sensitive LPM 
that renders the most prominent change in the biomass and 
biotransformation metrics (0.5 LPM in our case). 


2.8 CFD Simulation 
Setup and Integrated 
Modeling 
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e To examine the effect of RPM, fix the LPM at the most sensitive 


value (0.5 LPM) and then perform bioreactor experiments with 
induction over different RPMs (0, 100, 150, 225, 400). 


e Fit the dose-response behaviors using the Hill equation to 
analyze the impacts and sensitivities of the two determinants 
on the biomass growth and biotransformation performance. 


e Plot the productivity metric against the peak OD to identify 
their correlation behavior; a nonlinear relation has been 
observed in our study. 


¢ To better elucidate the combined influences of RPM and LPM 
on the biomass growth and biotransformation performance, we 
incorporate their compounded impact on determining the oxy- 
gen mass transfer coefficient and formulate a semi-empirical 
equation to account for the growth-associated bioproduction 
performance featuring the nonlinear relation. 


e Parameterize the developed model based on the time-response 
profiles of different cellular or environmental variables obtained 
from the bioreactor runs subjected to different RPM and LPM 
combinatorial operating conditions. 


¢ Deploy the model with the inferred parameters to predict the 
cell behaviors and performance for the full-factorial operating 
design spaces. 


e¢ This model-driven approach can be used to identify the most 
appropriate or optimal operating condition tailored to the 
desired goals (e.g., high productivity, high biomass growth, 
minimal operating cost, etc.). 


Non-ideal/suboptimal mixing condition could lead to 
non-uniform spatial distributions of the key variables like nutrients, 
dissolved oxygen, and the administrated substrate required for 
proper bioconversion and optimal cell growth. To demonstrate 
the spatial variations at different operating conditions, we integrate 
the developed cell mechanistic model with fluid dynamic model to 
provide detailed description of the local flow-field dynamics across 
the entire space of bioreactor as shown in Fig. 5. This section 
outlines the steps involved in preparing the bioreactor geometry, 
integrating the cell model as transport functions, and setting up the 
case for simulation. 


e Fluid dynamic simulation was executed using ANSYS® Aca- 
demic CFX Release 19.1. 


e Draw the geometry of bioreactor using SolidWorks, a 
computer-aided design (CAD) and analysis tool or ANSYS 
DesignModeler or SpaceClaim. 
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Cell kinetic model 


Ferulic Acid Vanillin 
(substrate) (product) 
Pathway Level Model 
ASM 


Integrated model 


Fig. 5 Integrated model development after coupling cell kinetic model to the bioreactor fluid dynamic model, 
which enables one to visualize the spatial distribution profiles of different cellular variables across the whole 
bioreactor. Top panel: Detailed cell mechanistic model. Bottom panel: Bioreactor geometry, generated mesh 
for simulation, and fluid flow vector field. (Parts of the figure adopted from [6] with permission) 


¢ Create the three parts/domains of bioreactor for fluid dynamic 
simulation: main bioreactor domain, rotating domain, and an 
injection domain. 


e Assemble the parts together by imposing the necessary con- 
straints with the rotating domain defined by moving reference 
frame methodology. 


¢ Generate the tetrahedral mesh for the geometry required for 
simulation in which the element count is approximately five 
times larger than the number of nodes. 


e Proceed to set up the settings for simulation such as Analysis 
Type and Solver. 


¢ Choose Transient Analysis to run the time-response study and 
set the Total Time Duration, the respective Time Steps, and the 
Initial Time point. 

e¢ Set the proper Solution Units: [kg] for Mass Units; [m] for 
Length Units; and [s] for Time Units in our case. As 1 g/L is 
equivalent to 1 kg/m*. 


¢ Choose second-order backward Euler option under Solver 
Control. 
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Under the Results tab of Output Control, choose the Mass 
Fraction for each of the variables and other components like 
Shear Strain Rate or Total Pressure, etc. 


Under the Trn Results tab of Output Control, add Transient 
Results and set the Time Interval (0.1 s for our case). 


Insert the different cell variables (cell biomass, oxygen, carbon 
dioxide, mRNA, and protein for two enzymes Fes and Ech, 
glycerol, M9 medium, ferulic acid, feruloyl-CoA, and vanillin) 
as new materials as Pure Substance and set their Thermody- 
namic State under Basic Settings and Molar Mass and Density 
under Material Properties. 


Include an additional material as Variable Composition Mixture 
and select all the defined materials earlier under Materials List 
and use Liquid as the Thermodynamic State. 


Insert all the cell model formulations into the Expressions after 
converting to mass fraction through dividing by total density. 


Include additional expressions to calculate the average of mass 
fraction for each component from different domains (may 
exclude the injection domain as it only occupies a small area/ 
volume which is negligible compared to other domains). 


Insert an expression to signify the cell inoculation at the top of 
the bioreactor (the inlet boundary of injection domain) using a 
step function following a mass flow rate of 0.1 g/s for a 2 s 
injection duration. 


Under the Flow Analysis, create a new boundary condition as 
Inlet and set the Boundary Type to Inlet and choose the Loca- 
tion to be the top surface of the injection domain. 


Under the Boundary Details, choose the Mass Flow Rate for 
Mass and Momentum, insert the defined variable name of the 
cell injection expression (the step function mentioned earlier) as 
the Mass Flow Rate, and choose Normal to Boundary Condi- 
tion as the Flow Direction. 


For the Component Details, select the cell biomass variable and 
set the Mass Fraction to 0.1 for the cell inoculation/injection 
process, whereas other variables are set to zeros. 


Under the Flow Analysis for the other domains, choose their 
respective locations, set to Fluid Domain, add new Fluid Defi- 
nition and define mix as Material, and set Continuous Fluid 
under Morphology, Reference Pressure of 1 atm with 
Non-Buoyant Model. 


For the rotating domain, to set the rotational speed, choose 
Rotating under Domain Motion and set the respective revolu- 
tion per min (100 RPM, 150 RPM, 225 RPM, 400 RPM for 
different biotransformation run cases) under Angular Velocity, 
and set the Rotation Axis (Global Y for our case). 
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29 CFD Results 
Visualization 


For the Fluid Models tab under Flow Analysis, set None (Lami- 
nar) under Turbulence for angular velocities lower than 
200 RPM and choose k-Epsilon for the other velocities (based 
upon the computed Reynolds number (Re) for the stirred-tank 
bioreactor given the fluid density, diameter of the impeller, 
dynamic viscosity of the fluid, and rotational speed for which 
the system is deemed to be fully turbulent for values of 
Re > 10,000). 


Choose Transport Equation for all the components under 
Component Models at the Fluid Models tab. 


For Initialization tab, set the initial values for glycerol and 
oxygen under Automatic with Value Option for simulations 
when cells were inoculated from the top of the bioreactor. 


Set the initial values for cell biomass as well for running simula- 
tions for the full bioconversion process. 


After setting up the case under Setup, proceed to the Solution 
section to define the Run Mode whether it is Serial or Parallel 
and choose Current Solution Data (if possible) for the Initiali- 
zation Option. 


Save the case file and submit the definition (.def file) to be run 
on any servers or workstations as the simulation is computa- 
tionally intensive and time-consuming. It is recommended to 
test running the case for a short period of time to ensure that 
the case is setup properly without any unintended errors. 


This section highlights the procedures involved in analyzing and 
visualizing the results (time response, spatial distribution, and ani- 
mation video) obtained from the CFD simulations. 


Once the simulations have completed, a result (.res) file and a 
folder containing the details will be generated. 


The .res file can be loaded into CFD-Post for visualizing the 
simulation results. 


Create a plane under the Location dropdown menu and choose 
the XY Plane which represents the middle plane of the bioreac- 
tor in vertical view. 


Create a contour plot or vector plot to view the concentration 
spatial distribution profiles or velocity vector field to identify the 
different flow patterns. 


Choose the created plane as Locations and choose the 
corresponding mass fraction as Variable and set the Range to Local. 


Set the other features like number of Contours for resolution, 
or settings under Labels and Render based on preferences. 


Use Timestep Selector to view the spatial distribution profile 
across the plane at specific time point. 
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e To generate video of the changes of spatial distributions across 
time, use Animation and choose Timestep Animation, set Cur- 
rent Timestep to the starting time point, click on Save Movie 
and select the path to save the video, and Play the Animation to 


initiate the process. 


e To view the time series data, click on the chart, choose the plot 
Type (XY — Transient or Sequence) under the Data Series tab, 
select Expression, and choose the corresponding variable that 
represents the computed average value for the different species 
from the dropdown menu. 


e Export the data as CSV file for external plotting. 
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Modeling Subcellular Protein Recruitment Dynamics 
for Synthetic Biology 


Kwabena A. Badu-Nkansah, Diana Sernas, Dean E. Natwick, 
and Sean R. Collins 


Abstract 


Compartmentalized protein recruitment is a fundamental feature of signal transduction. Accordingly, the 
cell cortex is a primary site of signaling supported by the recruitment of protein regulators to the plasma 
membrane. Recent emergence of optogenetic strategies designed to control localized protein recruitment 
has offered valuable toolsets for investigating spatiotemporal dynamics of associated signaling mechanisms. 
However, determining proper recruitment parameters is important for optimizing synthetic control. In this 
chapter, we describe a stepwise process for building linear differential equation models that characterize the 
Kinetics and spatial distribution of optogenetic protein recruitment to the plasma membrane. Specifically, 
we outline how to construct (1) ordinary differential equations that capture the kinetics, efficiency, and 
magnitude of recruitment and (2) partial differential equations that model spatial recruitment dynamics and 
diffusion. Additionally, we explore how these models can be used to evaluate the overall system perfor- 
mance and determine how component parameters can be tuned to optimize synthetic recruitment. 


Key words Mathematical modeling, Signal transduction, Protein dynamics, Diffusion, Optogenetics, 
Plasma membrane, Localization, Compartmentalization, Protein recruitment, iLID 


1. Introduction 


The cellular cortex is a primary site of signaling where dynamic 
protein and lipid scaffolds guide signaling networks to control 
essential cellular behaviors [1]. Signal processing at the plasma 
membrane occurs through multiple classes of mechanisms 
including local modification of cortical proteins, creation of lipid 
subdomains that directly recruit protein effectors, and activation of 
scaffold proteins that promote signal complex formation. In many 
cases, these processes can be hijacked by controlling the localization 
of specific pathway components. As a result, a number of engi- 
neered strategies that mimic primary modes of protein recruitment 
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Fig. 1 An assortment of signaling mechanisms for plasma membrane recruitment coupled with example 
design principles of associated optogenetic systems. (a) Left, cortical protein recruitment by receptor 
activation and clustering. Right, synthetic activation driven by CRY2 optogenetic receptor clustering. (b) 
Left, direct associations between plasma membrane lipid domains and lipid binding proteins. Right, local 
protein recruitment to plasma membrane domains after synthetic enrichment of signaling lipids using iLID 
optogenetic recruitment of a lipid-modifying enzyme. (c) Left, signaling complex formation downstream of an 
activated receptor. Right, synthetic production of signaling complexes through direct stimulation of optoge- 
netic opsin receptors 


to the plasma membrane have emerged as complementary toolsets 
in synthetic biology for investigating compartmentalized dynamics 
of signal transduction (Fig. 1). 

Engineered control of protein localization typically uses chem- 
ical [2-5 | and/or light-inducible [6-8] strategies. In general, these 
tools rely on tagging a target signal regulator and its binding 
partner(s) separately with genetically encoded affinity domains 
whose association requires exogenous activation. These approaches 
can also be adapted to locally recruit constitutively active regulators 
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to cellular compartments of interest that house important effectors 
[6]. Accordingly, exogenous recruitment of target proteins to the 
cell cortex can be achieved by anchoring one dimerization compo- 
nent to the inner leaflet of the plasma membrane. This strategy has 
been employed to selectively activate and recruit Rho GTPases 
[9, 10], control spindle positioning [11], investigate lipid regula- 
tion of ion channels [12], and decipher actin-mediated phosphoi- 
nositide 3-kinase feedback during cell polarization [13, 14]. In 
addition to activating downstream signaling, synthetic recruitment 
strategies have also been used for inhibitory roles by sequestering 
protein regulators away from their signaling niches [15]. The 
increasing diversity of synthetic strategies for protein recruitment 
and control has immense potential for elucidating complex signal- 
ing networks. However, these systems are built from biochemical 
components bound by the laws of chemistry and physics. Often 
their behaviors in real cells do not match the cartoon models that 
we draw based on their design, and system responses can be variable 
from cell to cell and from day to day. Computational methods 
provide a natural complement for these approaches by assessing 
how component features can be tuned to elicit desired dynamics. 
When component parameters are known or can be empirically 
estimated, mathematical models can become powerful tools that 
offer predictability and insights into how biochemical and physical 
constraints affect system performance. They can be particularly 
useful for characterizing the kinetics and spatial patterns of compo- 
nent outputs after compartmentalized recruitment. 

Here, we describe a stepwise approach to construct and apply 
mathematical models that characterize the kinetics and spatial dis- 
tribution of protein recruitment [7]. We describe the construction 
of a system of ordinary differential equations (ODEs) to analyze the 
temporal dynamics of protein recruitment and partial differential 
equations (PDEs) that incorporate spatial patterns. Such 
ODE/PDE- models have been useful in profiling 
membrane-associated processes including EGF receptor-mediated 
MAPK signaling [16], optogenetic membrane anchoring [7], sig- 
nal transmission from compartmentalized Ras GTPase nanoclusters 
[17], and membrane-associated Rho GTPase cycling [18-20]. To 
illustrate this approach, we specifically focus on a two-component 
ODE model that encompasses the dynamics of local recruitment of 
a protein species to the plasma membrane. We also derive an 
associated PDE model that incorporates spatial conditions, symme- 
try features, and the effect of diffusion on spatial association pat- 
terns. We describe how to compute these models using MATLAB; 
however, similar computational strategies can be implemented 
using other programming languages. 
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2 Materials 


2.1 Personal 
Computer 


3 Methods 


3.1 Modeling 
Recruitment Kinetics 
and Endpoint 
Dynamics 


A programming platform such as MATLAB equipped with algo- 
rithmic solvers for systems of ODEs and/or 1D PDEs. 


A key challenge in designing synthetic recruitment systems is 
achieving high levels of stimulus-induced responses with low basal 
recruitment. In general, the rate of membrane recruitment and the 
rate of dissociation are critical parameters that need to be optimized 
for this goal. We recently generated models to explore these fea- 
tures for a popular optogenetic approach, the improved light- 
induced dimerization (iLID) system [7]. iLID is an engineered 
protein containing a modified LOV2 domain that, in response to 
light, exposes a peptide from E. colz SsrA capable of binding with 
high affinity to a partner SspB fusion protein (Fig. 2a) [6]. By 
anchoring iLID to the plasma membrane, this system can be used 
to concentrate target proteins of interest to local membrane sites in 
response to blue light exposure. In addition to intrinsic features 
that control component binding, including component conforma- 
tion dynamics and binding specifications of the SsrA peptide and 
SspB, recruitment performance of iLID depends on extrinsic vari- 
ables, such as component concentrations and compartmental 
anchoring, that often require empirical optimization by the user. 
However, in silico approaches can be useful for identifying para- 
meters that help guide recruitment optimization. 

To predict the behavior of such a system, we can construct 
ODE models to identify expression regimes of component species 
for which membrane recruitment is specific and efficient. Our 
simple model contains two protein species where a substrate (S) 
concentrates to the plasma membrane upon activation of the 
recruiting receptor species (R) (Fig. 2b). We consider the [iLID] 
and [SspB] components as representations of [ R] and [S], respec- 
tively, but a structurally identical model can also be used to describe 
other optogenetic approaches or, alternatively, simple systems of 
localized protein recruitment. In this model, R is bistable; it can 
exist either in an inactive state with low affinity for substrate Sor as a 
high-affinity active state, R* (Fig. 2b). It is critical to consider 
binding for both states of R, as the basal binding of S to inactive 
Rcan be a key limitation of recruitment systems at high component 
concentrations. 

Here, we describe how to outline the primary states of the 
system and build an ODE model that captures the kinetics of 
protein recruitment. After defining component species, the inter- 
action states, reaction events, and initial conditions prior to recep- 
tor activation can be determined: 
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A. Receptor Mediated Membrane Recruitment 
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Fig. 2 Schematic diagrams of iLID recruitment and component interaction states. (a) Diagram illustrating 
idealized iLID and SspB interactions before and after light activation. (b) Schematic diagram depicting possible 
activation states and interaction events during membrane recruitment of substrate S by receptor A, including 
“dark state binding” in which the substrate binds to an inactive membrane receptor 


1. Define the molecular components of the system, and systemat- 
ically determine each state of the system and each possible 
transition between states. For our model, this corresponds to 
the diagram in Fig. 2b. 


2. Assign variables to the protein species and component states 
involved in recruitment. Be sure to have a variable for every 
molecular species in the model: 
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Protein Species ; : 
R = Free inactive receptor 


R* = Free active receptor 
S = Free substrate 
RS = Substrate bound inactive receptor 
R*S = Substrate bound active receptor 


3. Write chemical equations for each reaction in the model, 
including component interactions and transitions between pro- 
tein states. We assume that receptor activation occurs in 
response to the experimental stimulation with a single rate 
that is equal for all binding states of the receptor. For this 
example, receptor activation rate, y, will be nonzero during 
stimulation and zero otherwise: 


Interaction Events 
R+S= RS; 
Forward Rate = Ratejnactive, Bindings 
Reverse Rate = Ratetnactive, Release} 
R°+S=R'S; 
Forward Rate = Rate Active, Bindings 
Reverse Rate = Rate active, Release 
Receptor Activation Events 
R= R’; 
Forward Rate = 7; 
Reverse Rate = Ratergy 
RS = R*S; 
Forward Rate = 7; 
Reverse Rate = Raterey 


4. Define reaction constants. We can define the rate constants 
numerically using estimates based on published measurements. 
In many cases, the binding affinities (Ky) may be available, but 
the forward and reverse binding rates may not be. In this case, 
we relate both rates to the Kq and estimate the off-rate using 
published kinetic data or by calibrating the model to empirical 
measurements (see Note 1): 


Ratetnactive,Release 
Ratetnactive,Binding 


Kd inactive = 
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We will treat the forward rate of receptor activation (y) as 
an experimental input into the model, but the associated 
reverse rate can likely be empirically obtained or estimated 
from observations made in prior literature. 


5. Define mathematical versions of the rate equations for each 
species. Construct one differential equation for each species 
in the model (see Note 2). To simulate receptor activation, 
the receptor activation term, /jnput, Will depend on the external 
input at a given time. For iLID systems, 7inpur is the temporal 
profile of blue light irradiation: 


Le = Yinput * [R] + Rateactive,Release * [RS] — Rate active,Binding 
+ [R"] « [S] — Raterey * [R’] (1) 
ad ur a y= MI RS) MRA ac tnane® RIMS 
— RateActive,Release * [R*S] — RateRey * [R*S] (2) 
see = Ratetnactive,Release * [RS] + Raterey * [R*] 
— Ratejnactive,Binding * [R] * [S] — Yinput * [R] (3) 
a = Ratetnactive,Binding * [R] * [S] + Raterey * [R*S] 
— Ratetnactive,Release * [RS] — Yinpur * [RS] (4) 
sie = Ratetnactive,Release * [RS] + RateActive,Release * [R*S] 


= Ratetnactive Binding * [R] * [S] > Rate Active,Binding * [R*] 
« [S| (5) 


6. Define the initial state of the system. Prior to stimulation, we 
assume that the receptor is entirely inactive and the system is in 
steady state. Therefore, each reaction species can be defined 
using known measurements. At this initial state, variables 
[R]totra and [S],ota are defined to be constants representing 
the total concentration of the two proteins. Additionally, con- 
servation of mass can be used to relate free [| R] and [S] to the 
amount of complexed [RS]: 


[R"] = 0; 

[R°S] = 0; 

[R] = [Rhiora — [RS]; 
[S] = [Sloat — [RS]; 
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a[|RS] 
at 


7. Compute the concentration of RS at steady state by setting the 
rate al RS]/dt (Eq. 6) to zero: 


= Ratetnactive,Binding * [R] * [S| = Rate Inactive,Release * [RS] (6) 


0= Ratetnactive,Binding * [R] * [S| = Ratetnactive,Release 
* [RS] (At steady state) (7) 


0 = ([R] [RS]) * ([S] [RS]) —Ky*[RS] (8) 


total total 


0= [RS]? ale Reed . Sheet ~~ [RS] * eed ~~ [RS] * [Sl otal 
— Kg * [RS] (9) 


0= [RS}? = ([Rh coral zs [S} otal + K4) * [RS] 
+ ([Rhotal * [Slrotat) (10) 


UPC esl + [Sl otal + Ka) si ( Le seat + Deel al Kay —4% Ul st * [Slsiat’ 
2 


[RS] = 


(11) 


b— Je =4x ([R] cotal 7 [Sleotat) 
2 
= Pel ged + [Slrotal + Ka (11’) 


[RS] 


8. Now that rate equations for each reaction species and initial 
conditions are defined, reaction kinetics can be computed and 
the ODEs solved algorithmically. We have customarily used the 
MATLAB ODE solver function, ode45, for this; however, simi- 
lar algorithmic solvers of ODEs can be found in many other 
programming languages (see Note 3). 


3.2 Analyzing This approach can be customized to simulate kinetic dynamics of a 

Efficiency of variety of synthetic recruitment strategies by using different choices 

Recruitment of component concentrations, dissociation constants, and activa- 
tion /inactivation rates of the receptor (see Note 4). Here, we will 
briefly display example model performance measures using values 
determined for iLID-mediated recruitment. As a general note, we 
suggest simulating the model for a few choices of concentrations 
first to verify that the output looks reasonable and to gain an 
intuition for the model: 


1. The computed system of ODEs can be evaluated by analyzing 
notable features of model outputs. For example, basal recruit- 
ment, maximum recruitment, fold recruitment, and dissocia- 
tion each designate performance measures that help guide 
optimization strategies for efficient synthetic recruitment 
(Fig. 3). Basal recruitment can be computed from the initial 
steady state of the model, while other measures can be 
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y a = Basal recruitment 
b = Max recruitment 
(atb)/a = Fold enrichmenet 
c = t,,, for dissociation 


Membrane Recruited [S] 
([RS] + [R*S]) 


1, 
9 


Time 


Fig. 3 Representative plot of recruitment kinetics after receptor activation. 
Depiction of an example kinetic profile of recruitment after temporary receptor 
stimulation. Illustrated here are measures of recruitment dynamics captured in 
ODE/PDE models including basal recruitment, max recruitment, fold enrichment, 
and t,/2 of dissociation 


determined from the simulated model output (Fig. 4a). Max 
absolute recruitment can be determined from the steady-state 
solution during extended system input, similar to how initial 
conditions were calculated in the previous section (Fig. 4d). 
Fold recruitment can be calculated using computed values for 
basal and max recruitment (Fig. 4g). Kinetic parameters such as 
ty /2 Of dissociation are computed from temporal profiles for 
simulations involving transient system input (Fig. 4j). After 
analyzing the model across systematic ranges of concentrations 
for each component, heat map plots can be used to visualize 
how each performance measure depends on component con- 
centrations (Fig. 4b, e, h, k). Each pixel in the heat map 
represents outputs from model simulations using specific com- 
binations of parameters (Fig. 4c, f, i, 1; Top) (see Note 5). Lastly, 
analogous plots can also be generated to visualize how perfor- 
mance depends on additional variables such as rate constants 
and component binding affinity characteristics. 


2. Model output interpretation: Ideally, performance measures 
generated from model simulations can generally predict how 
recruitment parameters and component characteristics influ- 
ence recruitment efficiency. For example, our system of ODEs 
generally predicts that both basal and maximum membrane 
recruitments scale positively with increasing component con- 
centrations (Fig. 4c, f, bottom). While this trade-off limits 
system performance, the simulation results can be used to 
understand how fold recruitment scales with component con- 
centrations (Fig. 41, Jottom) and identify ranges of component 
concentrations where system performance is most efficient. By 
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3.3 Modeling Spatial 
Dynamics of 
Recruitment 
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defining threshold values for parameters of interest, an ideal 
concentration space can be determined where system perfor- 
mance exceeds each threshold. These results can be used to 
guide experimental design and troubleshoot parameters where 
synthetic recruitment is not performing as desired. Further- 
more, the model can make less intuitive predictions. For exam- 
ple, our iLID ODE model predicted that global iLID-SspB 
disassociation rates decrease with increasing iLID concentra- 
tion [7] (Fig. 41, bottom). This effect arises from newly disso- 
ciated SspB molecules being more likely to rebind at the 
membrane if surrounding levels of unbound iLID are high. 


Models generated from ODEs typically capture dynamics across a 
single dimension and are therefore suitable for determining the 
temporal evolution of reaction systems. However, in addition to 
kinetic features, cell signaling mechanisms often rely on spatially 
heterogeneous patterns. For protein interactions at the plasma 
membrane, cytoplasmic diffusion near the cell cortex and lateral 
diffusion along the membrane are important factors that influence 
spatial distributions of recruitment events. Additionally, functional 
outputs of biological signaling are often determined by spatially 
asymmetric propagation of signaling circuitry. Partial differential 
equations (PDEs) follow system dynamics across multiple indepen- 
dent variables and are useful for capturing how system components 
change in space and time. Here, we will generate a PDE model that 
illustrates how diffusion affects the spatial distribution of recruit- 
ment over time. Toward this goal, consider the general diffusion 
equation for temporal change in concentration of a chemical spe- 
cies over a one-dimensional spatial coordinate: 


2 
Ou Cy 
a = Da (12) 
Ot Ox 
where (x, ¢) represents a concentration value of species u at posi- 


tion « and time ¢. Additionally, ae represents the change of concen- 


tration of # over time, Se describes the profile of concentration 
across the spatial coordinate x, and D is the diffusion coefficient 
within the system. 

We can build our PDE model based on our previous ODE 
model through the following steps: 


1. Determine the spatial domain for the model. Using a 
one-dimensional domain simplifies computation and the 
interpretation of model results. We can modify symmetry to 
model higher-dimensional geometries using this simple 
domain. For analyzing the spatial spread of proteins on a 
two-dimensional plasma membrane, we define our spatial coor- 
dinate to be the radial distance along the membrane from the 
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target recruitment site. We use symmetry conditions in the 
model, to handle the increasing area associated with larger 
radial distances. This is accomplished in MATLAB’s pdepe 
PDE solver by setting the symmetry parameter m to | for 
“cylindrical” symmetry (see Note 6). 


. Define the diffusion coefficients. The diffusion coefficients are 


the only additional parameters for this model (see Note 7). 


. Define the molecular species within the variable w. 


Importantly, #(x, ¢) can represent a single concentration species 
across the spatial coordinate « and time coordinate ¢ or, for 
signaling circuits that involve multiple reaction species, a matrix 
that incorporates all relevant species within the system: 


) 


u(x,t) = | uz]; (13) 


where y= [R], U2 = [R*], U3 = [RS], U4 = [R*S], u5 = [S]. 

For our purposes, we interpret the units for the spatial 
dimension to be in microns since this is a relevant scale for 
a cell. 


. To build a PDE, spatial boundary conditions must also be 


defined. We typically use the Neumann condition (see Note 
8), specifying that the spatial derivative is zero at the 
boundaries: 


au 
— = 0;at x = Oand & = Xpax 


ax 


. In contrast to the ODE example, we will now assume that this 


system begins with a pre-established profile of active receptors. 
This approach is useful for analyzing the spread of recruited 
components after an initial standardized input. Therefore, 
initial conditions can be adapted as follows: 

uy, = ((R] = BB) — 12; 


total 


ae ‘ 
where BB = (° ve 4 — ae} 


oy 


b= [R] otal + [Sheetal Re Ka 


u2 = ((R] BB) « r(x); 


total 


13 = BB-— U4; 
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u4 = BBx r(x); 
U5 = [Shrotat — 43 — U4; 


where 7(x) is a function that determines the spatial distribution 
of receptor activation and BB represents the basal bound state 
of receptor. For example, in optogenetic systems such as iLID, 
after global light activation, 7(x) can be set to a Gaussian profile 
peaking at «= 0, with a width determined by the resolution of 
focal stimulation for an optical microscope system. 


. For algorithmic evaluation of PDE models, programmatic 
solvers often require representing PDEs in standard organiza- 
tional forms. In MATLAB, a standard form for 1D PDE 
solvers is: 


pc Om _ ym © Pf ace 
Ox } Ot Ox Ox 


Ou 
+ s(1 t, uN, 5) (14) 
Ou 


where f (x, t, U, ) is a term for the spatial flux of the species , 
ie: t, u, a) is a source term or reaction term that, in this case, 
will incorporate binding and chemical reactions that generate 
or deplete species within ~, and c(x,¢, n, $# ) represents a 
balance coefficient. mis the symmetry constant that determines 
the type of spatial symmetry in the system; m = 0, 1, or 
2 represents Cartesian (no symmetry), cylindrical symmetry 
(azimuthal), or spherical symmetry (azimuthal and zenith) 
coordinates, respectively. 


. Referring to the initial equation for diffusion (Eq. 12), with an 
addition of the reaction term, its standard form can be 
rewritten as: 


Ou oO Ou Ou 
an re (p =") s( ws), $2) (15) 


where: 


In this system, the source term s corresponds to the same 
set of terms as the right-hand side of the equations from our 
ODEs generated previously (Eqs. 1, 2, 3, 4, and 5). 
Collectively, these equations can be written in matrix form as 


s(u(x, t), Su) (see Note 9): 
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Rate nactive,Release * U3 + Ratepey * U2 — Ratetnactive,Binding * UW) * U5 


S$] 


S2 


$4 
$5 


3.4 Analyzing 
Recruitment and 
Diffusional Spread 


Rate active,Release * U4 — Rate Active Binding * U2 * U5 — RateRey * “2 
Ratetnactive,Binding * Wy * U5 + RateRey * U4 — Rate fnactive,Release * 13 
Ratective,Binding * U2 * U5 — Rate active,Release * U4 — RateRey * U4 


Ratetnactive Release * U3 + Rate active,Release * U4 — Ratejnactive,Binding * U)\ * U5 


— Rate Active Binding * U2 * U5 


. At this stage, with the initial and boundary condition set, the 


PDEs can be integrated using programmatic solvers such as 
pdepe function in MATLAB (see Note 10). 


. Just as with the ODE model, measures of system performance 


should be computed. While basal recruitment can be computed 
similarly, maximal recruitment will be different since PDE 
models simulate recruitment in a local subregion of the plasma 
membrane that evolves over time (Fig. 5a). In this case, both 
dissociation and diffusional spread can reduce the local accu- 
mulation of the recruited protein. For this reason, maximal 
recruitment should be determined by following simulation 
outputs over time. Additionally, the dependence of recruit- 
ment on the concentrations of R and S will likely scale differ- 
ently from what was previously produced in the ODE model. 


. Compared to ODE simulations, the PDE model naturally pre- 


sents additional measures of system performance. Most nota- 
bly, the spread of recruitment regions can be calculated from 
the spatial recruitment profiles at given component concentra- 
tions and diffusion characteristics (Fig. 5a). 


. As with the ODE model, thresholds for the PDE system can be 


set for each measure of system performance. Regions in com- 
ponent concentration spaces that exhibit acceptable recruit- 
ment performances can be identified and subsequently used 
to inform how synthetic systems can be optimized at the bench 
(Fig. 5b, d, f). For example, we have used similarly structured 
PDE models to optimize iLID recruitment approaches. These 
PDE models profiled how customizing plasma membrane 
anchoring strategies that confer differential membrane diffu- 
sion properties to iLID receptors influence spatiotemporal 
SspB recruitment (see Notes 11 and 12). PDE modeling of 
iLID recruitment offered interesting predictions. Constraining 
receptor diffusion at the membrane by increasing membrane 
anchor size resulted in significant changes in substrate recruit- 
ment levels. This model predicted that decreasing receptor 
diffusion promoted increased maximum recruitment, fold 
recruitment, and lengthened evolution time to maximal 
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Fig. 5 PDE modeling captures the effects of receptor diffusion on spatial spread and recruitment dynamics 
across broad ranges of component expression levels. (a) Time-lapse plots displaying spatial spread of 
recruitment predicted by PDE modeling. Additionally, important measures of recruitment are depicted 
including max recruitment, basal recruitment, time to max recruitment, and recruitment spread. (b, d, f) 
Heat map plots generated from PDE models displaying the effect of receptor diffusion on individual features of 
recruitment across ranges of [A] and [S] concentrations. In these examples, values represent real recruitment 
measures predicted by PDE simulations of iLID-SspB interactions with two different membrane anchors (see 
Note 10). Red bars designate isolated concentration regions portrayed in c, e, and g. (Cc, e, g) Line traces of 
heat map insets (red bars) from b, d, and f comparing the effect of changing receptor diffusion on individual 
recruitment features across a range of receptor concentrations 
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4 Notes 


recruitment of substrate across wide ranges of receptor and 
substrate concentrations (Fig. 5c, e, g). 


Altogether, two-component ODE and PDE models reliably 
capture fundamental features of protein recruitment dynamics. 
With proper reaction constants and measures for component fea- 
tures in hand, modeling approaches like these can be implemented 
in a_ straightforward manner. Additionally, these analytical 
approaches offer powerful predictive strength for synthetic recruit- 
ment strategies and can provide unique insights into efficient 
manipulation of compartmentalized signaling using synthetic tools. 


1. To estimate the kinetic binding and dissociation rates, one can 
make some simplifying assumptions. In many cases, binding 
affinities are determined largely by the dissociation rates. For 
simplicity, we can assume that the association rates are equal for 
inactive and active forms of R. We then can estimate or cali- 
brate the association and dissociation rates from kinetic experi- 
ments measuring the half-time for association after a strong 
light stimulus. Importantly, association rates are likely to be 
different for different optogenetic systems. For example, the 
“magnets” system was designed to have a more rapid associa- 
tion rate [21, 22]. 

2. It is essential that as differential equations are built, balance is 
maintained according to the law of conservation of mass. This 
can be checked by making sure that for each equation, all events 
that either produce or consume the target component species 
are represented. 


3. In many cases, 7input Will be a step function. These are typically 
not handled well by numerical integration algorithms such as 
ode45. A handy solution to this problem is to perform piecewise 
numerical integration. One can separately perform numerical 
integration for each time period in which /inpur is Constant, 
using the output of each round of numerical integration as 
the initial condition for the next. For example, a simple experi- 
ment where /Yinpur is 1 for the first round and then 0 thereafter 
would require two separate rounds of numerical integration 
with the second using the conditions produced by the first 
round. 


4, Accurate estimations for component concentrations, dissocia- 
tion constants, and reaction rates can improve predictive ability 
of ODE/PDE models. For modeling iLID recruitment, we use 
an assortment of values either derived empirically or approxi- 
mated using measurements from similar mechanisms. The 


10. 
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following parameter values have been useful for modeling iLID 
dynamics (with associated references): 


Total [iLID] = 0.1 pM 
Total [SspB] = 0.5 pM 
K ait(active) = 130 nM [6] 
K gpark(inactive) = 4.7 1M [6] 
Rateir Reversal = 0.02 s [21] 
Ratedianodiionts =0S5-’ [21] 


. Model outputs may evolve over time; therefore, it is important 


to verify that simulations are run over long enough time peri- 
ods to determine the correct value. 


. While cylindrical symmetry is useful for simulating spot recruit- 


ment and diffusion along a flat membrane interface, in other 
cases, it may be useful to model diffusion of a cytoplasmic 
component toward or away from the membrane in a spherical 
cell. For the latter, the symmetry parameter m, in MATLAB’s 
pdepe PDE solver, can be set to 2 for designating spherical 
(azimuthal and zenith) symmetry coordinates. 


. Diffusion coefficients can be determined empirically, for 


instance, through fluorescence recovery after photobleaching 
experiments. As a rough guide, diffusion coefficients may be 
around 10-30 pm?/s for cytoplasmic proteins, 0.5—1 jum?/s for 
lipid-anchored proteins, and 0.03-0.1 m7/s for transmem- 
brane proteins. 


. The Neumann boundary condition specifies that the spatial 


derivative of a system is constant at its boundaries. By setting 
the derivative to zero at each boundary, the resulting condition 
can be thought of as a “reflecting” boundary which maintains 
the flux of model components within the spatial barriers of the 
system. Therefore, under this condition, there is no passage of 
molecular species in or out of the system through the boundary 
which helps ensure conservation of mass. 


. Note that the source term s for the PDEs encompasses kinetic 


parameters and interaction states of component species. To also 


incorporate light input, s(x, tu, on) can include a Yinpur term 
that designates a temporal profile of blue light activation such 


as in Eqs. 1, 2, 3, 4, and 5. 

Note that while PDE solvers in other programming languages 
may have similar requirements for initial conditions, boundary 
conditions, source, and flux terms, they may require different 
organizational formats for proper implementation. 
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11. For this example, we implemented a PDE model of iLID 
diffusion where membrane diffusion coefficients for iLID- 
CAAX (short anchor) and Stargazin-iLID (long multipass 
anchor) were estimated to be 1 pm*/s and 0.1 jim?/s, respec- 
tively, based on observations from previous studies [23, 24]. 


12. 


Custom MATLAB code for implementing both ODE and 


PDE models designed for iLID recruitment can be found at 
https: //github.com/srcollins/Code_for_iLID_Recruitment_ 
from_Springer-Protocol-Chapter-2022. 
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Genome-Scale Modeling and Systems Metabolic 
Engineering of Vibrio natriegens for the Production 
of 1,3-Propanediol 


Ye Zhang, Dehua Liu, and Zhen Chen 


Abstract 


The fastest-growing bacterium Vibrio natriegens is a highly promising next-generation workhorse for 
molecular biology and industrial biotechnology. In this work, we described the workflows for developing 
genome-scale metabolic models and genome-editing protocols for engineering Vibrio natriegens. A case 
study for metabolic engineering of Vibrio natriegens for the production of 1,3-propanediol was also 
presented. 


Key words Vibrio natriegens, Systems metabolic engineering, 1,3-Propanediol 


1 Introduction 


Vibrio natriegens is a moderately halophilic, gram-negative, and non- 
pathogenic microorganism isolated from salt marshes [1]. It is the 
fastest-growing microorganism identified thus far, with a minimal gen- 
eration time between 7 and 10 min [2]. More importantly, Vibrio 
natriegens has an exceptionally high glucose uptake rate in minimal 
media under both aerobic (~3.90 + 0.08 g/g/h) and anaerobic con- 
ditions (~7.81 + 0.08 g/g/h) [3]. The inherent properties of Vibrio 
natriegens make it a promising platform for next-generation biotech- 
nology [4, 5]. Recently, applications of Vibrio natriegens in molecular 
biology for vector construction, protein expression, and cell-free syn- 
thesis have been widely explored and well established [1, 6-10]. Appli- 
cation of Vibrio natriegens for the production of chemicals, such as 
2,3-butanediol, 1,3-propanediol, L-alanine, and melanin, was also 
demonstrated, highlighting its potential for industrial biotechnology 
[3-5, 11]. The development of system biology tools and genome 
editing technologies also significantly accelerated the exploration and 
development of V. natriegens as a new industrial chassis [5, 12-14]. 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_11, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 
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2 Materials 


2.1 Data Sources 
and Software 


2.2 Strain and Media 
Recipes 


The construction and application of genome-scale metabolic 
models (GEMs) of industrial chassis are important for systemati- 
cally investigating and predicting its metabolic characteristics and 
physiological properties and are widely used in computational biol- 
ogy, systems metabolic engineering, and synthetic biology applica- 
tions [15, 16]. To the best of our knowledge, the reconstruction of 
GEM of V. natriegens has not been reported to date. 

The development of efficient genome-editing tools is essential 
for constructing microbial cell factories. Vibrio species can actively 
take up exogenous DNA from the environment and integrate it 
into their genome via natural transformation [12, 17]. By combin- 
ing natural transformation with the expression of the competence 
regulator TfoX and the FLP/FRT recombination system, we have 
developed an efficient approach for multiplex genome editing of 
V. natriegens [5]. 

Here, we will present the workflow for generating a high- 
quality GEM of V. natriegens and a genome-editing protocol for 
engineering V. natriegens. A case study of the systematic metabolic 
engineering of V. natriegens for the production of 1,3-propanediol 
is also demonstrated. 


The methodologies and protocols for generating the GEMS of 
microorganisms have been published previously [18, 19]. Data 
sources and software used for the reconstruction of GEMS are 
listed in Table 1. We will present the workflow for generating a 
high-quality GEM of V. natriegens based on AutoKEGGRec, which 
is an efficient tool for the generation of draft models [20]. Auto- 
KEGGRec is a user-friendly tool to create draft models based on the 
MATLAB platform and KEGG database. It is compatible with the 
COBRA toolbox and convenient for the conversion of model files 
to SBML [20]. The reconstruction of GEM of V. natriegens con- 
sists of five stages. 


. Vibrio natriegens ATCC14048. 
. IPTG. 
. Rhamnose. 


. LBv2 medium: LB broth supplemented with v2 salts (200 mM 
NaCl, 23.14 mM MgCl, and 4.2 mM KCl). 


5. BHIv2 medium: 37 g/L BHI and v2 salts. 


6. Electroporation buffer: 232.8 g/L sucrose and 1.22 g/L 
KGHPO4 (pH 7.0). 


7. Ocean salt medium: ocean salt 28 g/L. 


mB ow NHN 


8. Antibiotics: kanamycin (100 pg/mL), spectinomycin (100 pg/ 
mL), ampicillin (100 pg/mL), chloramphenicol (5 pg/mL). 


Table 1 


Data sources used for the reconstructions of GEMs 


Name Link References 
Genome and genomic annotation databases 
Annotation of MIcrobial Genes http: //www.genoscope.cns.fr/agc/tools /amigene/ [34] 
index.html 
BAR https: //bar.biocomp.unibo.it/bar3/ [35] 
Bioconductor https: //bioconductor.org/ [36] 
Genomes OnLine Database https: //gold.jgi.doe.gov/ [37] 
KBase https: //www.kbase.us/develop/ [21] 
KEGG Automatic Annotation _ https://www.genome.jp/kegg/kaas/ [38] 
Server 
NCBI Entrez Gene http://www.ncbi.nlm.nih.gov/sites /entrez [39] 
RAST https: //rast.nmpdr.org/ [40] 
Biochemical databases 
BRENDA https: //www.brenda-enzymes.info/ [41] 
KEGG https: //www.genome.jp/kegg/ [42] 
ModelSEED https: //modelseed.org/ [22] 
pKa DB http://www.acdlabs.com/products/phys_chem_lab/ 
PubChem http: //pubchem.ncbi.nlm.nih.gov/ [43] 
Transporter Classification http: //www.tcdb.org/ [44] 
Database (TCDB) 
TransportDB http://www.membranetransport.org/transportDB2/ [45] 
index.html 
UniProt https: //www. UniProt.org/ [46] 
Protein localization databases 
BASys http: //basys.ca/ [47] 
PSORT https: //www.psort.org/psortb/ [48] 
Reconstruction resources and software 
AutoKEGGRec https: //github.com/emikar/AutoKEGGRec [20] 
COBRA https: //opencobra.github.io/cobratoolbox/stable/ [28] 
KBase https: //www.kbase.us/develop/ [21] 
MATLAB https: //www.mathworks.com/products/MATLAB. 
html 
Merlin https://merlin-sysbio.org/ [23] 
MetaCyc https://metacyc.org/ [24] 
ModelSEED https: //modelseed.org/ (22) 
OptFlux http://www.optflux.org/ 
RAVEN https: //github.com/SysBioChalmers/RAVEN 25)| 
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2.3 Plasmid and DNA 
Cassette 


9. Fermentation medium: KH,PO, 1.0 g/L, yeast extract 5 g/L, 
(NH4)2SO4 5 g/L, NaCl 15 g/L, glycerol 50 g/L, 
MgSO4-7H20 1 g/L, CoCl)-6H20 0.01 g/L, MnCl)-4H,O 
0.01 g/L, FeSO4-7H20 0.01 g/L, vitamin Bj, 0.005 g/L. 

10. Fermentation feeding medium: 600 g/L glycerol and 10 g/L 
yeast extract. 


Plasmid pXMJ19-tfoX consists of an IPTG-inducible competence 
regulator gene tfoX allowing Vibrio cells to become competent and 
a rhamnose-inducible flp gene to remove selection markers with 
FRT sites [5, 12]. 

The recombinant DNA cassettes used for homologous recom- 
bination and gene editing are obtained by overlap extension PCR 
containing the upstream fragment (~3000 bp) of the target gene, 
the selection marker gene with two FRT loci, and the downstream 
fragment (~3000 bp) of the target gene. The selection marker 
genes could be resistance genes to kanamycin (Kan), spectinomycin 
(Spec), ampicillin (Amp), etc. 


3 Method 


3.1. Genome-Scale 
Modeling 


3.1.1 Draft 
Reconstruction 


This section aims to provide a detailed guide for the construction 
of genome-scale metabolic model of V. natriegens (Fig. 1) [5]. 


The main task of this stage is to obtain genome annotation 
information and biochemical information, including metabolite 
candidate and metabolic reaction information, for the reconstruc- 
tion of a draft metabolic model. Since the reconstruction and 
application of GEMs mainly rely on the biochemical data and 
metabolic reactions of the draft, the quality and reliability of the 
genome annotation information are critical to the quality of the 
reconstruction. Therefore, it is important to acquire the latest and 
most creditable genome annotations. This stage could be accom- 
plished automatically through many advanced software or websites, 
including MetaDraft, MetaCyc, RAVEN, Merlin, ModelSEED, 
KBase, AuttoKEGGRec, etc. [20-27]. 

AutoKEGGRec is an algorithm designed to interact with the 
COBRA toolbox based on MATLAB platform. After proper con- 
figuration, it can directly obtain all metabolites, reactions, genes, 
annotations, and gene-protein-reaction rules based on the KEGG 
database for the target microorganism by a simple command: 


outputStruct = AutoKEGGRec(KEGG organism IDs). 


For V. natriegens, the corresponding command is vva. Thus, 
the following command can be used to get the draft model: 


outputStruct = AutoKEGGRec(vna) 


16.Add exchange reactions. 


19. Add sink reactions. 


17.Determine and add biomass reaction. 


1 
1 
1 
' 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
' 
1 
1 
1 
11.Add metabolite identifier and related notes. ‘ 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
18.Add ATP-maintenance reaction. 
1 

1 

1 

1 

1 

! 

1 

1 
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Stage 1 Draft reconstruction 

1. Obtain genome information and genome annotation. 
2. Obtain candidate metabolite and metabolic reaction. 
3. Assemble draft reconstruction. 


Stage 2 Refinement of reconstruction 


4. Determine and verify substrate and cofactor usage. Stage 4 Network evaluation 

5. Determine the charged formula. 24. Test if network is mass- and charge balanced. 

6. Calculate reaction stoichiometry. 25. Identify metabolic dead-ends. 

7. Determine reaction directionality. 26. Identify and fix gaps. 

8. Add information for gene and reaction localization. 27.Adjust the simulation constraints for specific conditions. 

9, Add subsystems information. 28.Test if biomass precursors can be produced in specific medium. 

10. Verify gene-protein-reaction association. 29.Compare predicted physiological properties with known properties. 


12.Repeat steps 4 to 11 for all genes. 

13.Add spontaneous reactions to the reconstruction. 

14. Add extracellular and periplasmic transport reactions. 
15.Add intracellular transport reactions. 


Stage 3 Reconstruction of mathematical model 
21.Configure the COBRA toolbox. 

22.Load reconstruction into MATLAB. 

23.Set objective function and suitable simulation constraints. 


20.Determine growth medium requirements. 


Stage 5 Data assembly and dissemination 
30.Print MATLAB model content. 

31.Add gap information to the reconstruction output. 
32.Simulation and analysis. 


Fig. 1 Brief overview of iterative reconstruction of a genome-scale metabolic model. The general procedure is 


referenced from [19] 


3.1.2 Reconstruction 
Refinement 


It should be noted that the draft obtained in the first stage might be 
incomplete and contain many errors, including uncertain cofactor 
preference, inaccurate reaction stoichiometry, and missing reac- 
tions and genes. These issues need to be carefully calibrated. In 
addition, the localization of enzymes should also be determined 
and subsequently contribute to the addition of transport reactions 
and exchange reactions. Metabolite identifiers, related references, 
and notes also need to be added to improve the readability and 
compatibility of the model. 

Moreover, the formula of the biomass reaction should be esti- 
mated or determined, which plays an important role in in silico 
simulation [19, 26]. The biomass reaction formula consists of all 
known components and their fractional contributions to the overall 
cellular biomass, including protein, RNA, DNA, lipids, lipopoly- 
saccharides, peptidoglycan, glycogen, polyamines, etc. The 
growth-associated ATP maintenance (GAM) reaction and the 
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3.1.3 Reconstruction of 
the Mathematical Model 


3.1.4. Network Evaluation 


3.1.5 Data Assembly and 
Dissemination 


3.2 Gene-Editing 
Protocol (Fig. 2) 


3.2.1 Introduction of 
Plasmid pXMJ19-tfox 


nongrowth-associated ATP maintenance (NGAM) reaction, which 
account for the energy necessary for cell replication or maintaining 
the cell, respectively, should also be determined by chemostat 
growth experiments or estimated according to the available 
literature [19]. 

Curation and refinement are important to reconstruct a high- 
quality GEM. Detailed steps can be found in Fig. 1. This stage 
could be accomplished by employing biochemical databases and 
software, but the manual evaluation is still indispensable. Databases 
including NCBI, SEED, KEGG, BRENDA, TransportDB, Uni- 
Prot, etc. could be helpful. Since the AuttoKEGGRec is compatible 
with the COBRA toolbox, any correction could be directly added 
to the existing data [28]. 


In this stage, the refined biochemical information is converted into 
a mathematical format. MATLAB supplemented with the SBML 
toolbox, COBRA toolbox, and an LP solver could automate this 
process [28-30]. Moreover, the system boundaries and simulation 
constraints are defined in this stage, which convert the GEMs to 
condition-specific models. Due to the increasing abundance of 
biological and biochemical information, fine-tuned constraints 
could be set to improve the accuracy and reliability of the model 
compared to the actual metabolism. 


This stage consists of model verification and evaluation. Although 
reconstruction refinement is performed in stage 2, there could still 
be some omissions or errors in the metabolic model and mathe- 
matical model, including inappropriate constraints, missing trans- 
port reactions or exchange reactions, dead-end metabolites, 
network gaps, etc. It is important to test whether biomass and 
biomass precursors can be produced in specific media in this 
stage. This contributes to analyzing and investigating the difference 
between the simulated result and actual metabolism and further 
refining the GEMs. Iterative manual refinement is important and 
necessary to reconstruct a high-quality GEM. MATLAB and 
COBRA are helpful to identify and fix these problems. 


Once iterative and precise refinement is achieved, GEMs can be 
employed for in silico analysis. By defining desired and appropriate 
constraints, particular metabolic characteristics and metabolic flux 
distribution could be obtained to investigate the properties of 
microorganisms and to guide metabolic engineering. 


The introduction of plasmid pXMJ19-tfoX into V. natriegens could 
be achieved by electrotransformation [1, 31]: 


1. Inoculate V. natriegens in 5 mL LBv2 medium overnight at 
37 °C and 200 rpm. 
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Fig. 2 Genome-editing sketch map of V. natriegens. The general procedure is referenced from [5] 


2. Inoculate the overnight culture in 100 mL BHIv2 medium at a 
dilution of 1:100, and grow it at 37 °C and 200 rpm until an 
OD600 of 0.5. 


3. Transfer the culture to precooled 50 mL tubes and incubate on 
ice for 20-30 min. 


4. Centrifuge the culture at 4 °C and 6500 rpm for 15 min. 
5. Decant the supernatant and gently resuspend the cell pellets 
with 5-10 mL electroporation buffer. 


6. Add 20-30 mL electroporation buffer and centrifuge the cells 
at 6750 rpm and 4 °C for 15 min. 


7. Repeat the wash two or three times. 


8. Gently resuspend the cell pellets with electroporation buffer to 
obtain the final OD¢ 0 16. 


9. Divide cells into chilled tubes. 
10. Add plasmids to the cells and gently mix (2:100 v/v). 


11. Transfer the mixture to a precooled 1 mm electroporation 
cuvette and electroporate with the following parameters: 
800 V, 25 pF, 200 Q, and 1 mm cuvette. 


12. After electroporation, add 500 pL BHIv2 medium immedi- 
ately, and culture the mixture at 37 °C and 200 rpm for 1-2 h 
for recovery. 


13. Plate out the culture on solid LBv2 plates containing chloram- 
phenicol, and incubate overnight at 37 °C for colony growth. 


3.2.2 Natural 1. Incubate the strains harboring pXMJ19-tfoX overnight in 
Transformation LBv2 medium with 5 pg/mL chloramphenicol and 1 mM 
IPTG at 30 °C and 200 rpm [32]. 
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3.2.3 Elimination of the 
Selection Marker and 
Curation of Plasmid 
pXMJ19-tfoX 


3.3 Systems 
Metabolic Engineering 
of V. natriegens for 
the Production 

of 1,3-Propanediol 


2. Dilute the overnight culture 100 times with ocean salt medium 
containing 1 mM IPTG. 


3. Add 200 ng recombinant DNA fragment. 


4. Incubate the mixture statically at 30 °C for 4-6 h for natural 
transformation. 


5. Culture the cells at 30 °C and 200 rpm for recovery with the 
supplement of 1 mL of LBv2 medium. 


6. Plate out the recovery culture on solid LBv2 plates containing 
appropriate antibiotics, and incubate overnight at 37 °C for 
colony growth. 


To eliminate the selection marker: 


1. Culture the selected strain in 5 mL LBv2 medium with 1 mM 
rhamnose at 37 °C and 200 rpm for 12 h. 


2. Dilute the overnight culture for 100 times with LBv2 medium, 
and cover it on solid LBv2 plates with only chloramphenicol to 
maintain pXMJ19-tfoX. 


3. Screen the strains without selection markers by testing the 
antibiotic resistance of the colony or colony PCR. 


To cure plasmid pXMJ19-tfoX: 


1. Culture the strains in 5 mL LBv2 medium without antibiotics 
at 37 °C and 200 rpm for 12 h. 


2. Dilute the overnight culture for 10,000 times with LBv2 
medium, and cover it on solid LBv2 plates without any 
antibiotics. 


3. Screen the strains without plasmid by testing the antibiotic 
resistance of the colony or colony PCR. 


1,3-Propanediol (1,3-PDO) is a valuable chemical that is used as a 
solvent, an antifreeze, and a monomer for the synthesis of poly- 
ethers, polyurethanes, and polyesters. Importantly, 1,3-PDO can 
be used as a building block for the synthesis of a high-performance 
polyester, polytrimethylene terephthalate (PTT), which is widely 
used in carpets, automotive fabrics, furnishings, garments, and 
many other industries [5, 33]. 

According to the protocol described in part 2,a GEM of Vibrio 
natriegens through AutoKEGGRec is generated [19, 20]. After iter- 
ative refinement, the general compositions of the model are shown in 
Table 2. Then the synthetic pathway of 1,3-PDO from glycerol is 
introduced into refined GEM and imported into OptFlux 
[5, 33]. By investigating and analyzing the perturbation of the 
heterologous 1,3-PDO synthesis pathway to the metabolic flow 
distribution of V. natriegens, we developed several systems metabolic 
engineering strategies to enhance the production of 1,3-PDO: 
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Table 2 
Composition of the refined GEM of V. natriegens 


Gene Gene 1183 
Gene rules 1613 

Metabolite Metabolite 1299 
Compartments 3 

Reaction Reaction 1527 
Metabolism 1265 
Transport reaction 179 
Exchange reaction 83 

1. Knockout of genes involved in byproduct formation. 


2. Improvement of the intracellular reducing environment. 


. Balance of the 1,3-PDO synthesis module and _glycerol- 


oxidative pathway. 


. Optimization of the cultivation process. 


According to these strategies, the following gene modifications 


and process optimization are carried out: 


1. 


Deletion of the adhE, IdhA, pta-ackA, pfl, and aldAB genes to 
block metabolic fluxes to ethanol, lactate, acetate, formate, and 
3-hydroxypropionic acid. 


. Deletion of the global transcriptional regulators ArcA and 


GlpR to improve glycerol metabolism and increase the intra- 
cellular reducing power. 


. Deletion of sthA gene and overexpression of putAB genes to 


further improve the intracellular concentration of NADPH. 


. Pathway engineering by combinatorial optimization to balance 


the 1,3-PDO synthesis module and glycerol-oxidative pathway 
and reduce the accumulation of toxic intermediate metabolite 
3-hydroxypropionaldehyde. 


. Optimization of fermentation process by adjusting the dis- 


solved oxygen. 


The performance of the engineered strain is tested via fed-batch 


fermentation in a 400 ml T&J minibox parallel bioreactor. 


For fed-batch fermentation, seed culture is grown for 5 h in 


LBv2 medium at 37 °C and 200 rpm and then inoculated into 
parallel bioreactors (10% v/v). The fermentations are conducted at 
37 °C, pH 6.5 (controlled with 5 M NaOH), and an aeration rate 
of 2.0 vvm. The rotation speed is automatically adjusted to main- 
tain the dissolved oxygen to 10% of the saturated oxygen. A feeding 
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medium is added to maintain the concentration of glycerol over 
10 g/L. The final engineered strain can efficiently produce about 
70 g/L. 1,3-PDO from glycerol with a yield of 0.61 mol/mol and a 
productivity of 2.36 g/L/h in fed-batch fermentation. 


. Since biological information is constantly updated and revised, 


GEMs should be updated thereupon. 


. Wrong reaction directionality could result in abnormal meta- 


bolic flow distribution and futile cycle, which should be revised 
based on published literature and thermodynamic data. 


. Experiments or thermodynamic data could be helpful to set 


appropriate boundary conditions for simulations. 


. V. natriegens grows very quickly. The experiment period should 


. The cell concentration has a great influence on the natural 


transformation efficiency. ~10° CFUs in 350 pL ocean salt 


. V. natriegens has a strong tolerance to kanamycin. False- 


positive clones should be discriminated during screen. 


. The time of rhamnose induction is important, and 12 h is 


. Statical incubation is required during natural transformation. 
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Application of GeneCloudOmics: Transcriptomic Data 
Analytics for Synthetic Biology 


Mohamed Helmy and Kumar Selvarajoo 


Abstract 


Research in synthetic biology and metabolic engineering require a deep understanding on the function and 
regulation of complex pathway genes. This can be achieved through gene expression profiling which 
quantifies the transcriptome-wide expression under any condition, such as a cell development stage, 
mutant, disease, or treatment with a drug. The expression profiling is usually done using high-throughput 
techniques such as RNA sequencing (RNA-Seq) or microarray. Although both methods are based on 
different technical approaches, they provide quantitative measures of the expression levels of thousands of 
genes. The expression levels of the genes are compared under different conditions to identify the differen- 
tially expressed genes (DEGs), the genes with different expression levels under different conditions. DEGs, 
usually involving thousands in number, are then investigated using bioinformatics and data analytic tools to 
infer and compare their functional roles between conditions. Dealing with such large datasets, therefore, 
requires intensive data processing and analyses to ensure its quality and produce results that are statistically 
sound. Thus, there is a need for deep statistical and bioinformatics knowledge to deal with high-throughput 
gene expression data. This represents a barrier for wet biologists with limited computational, programming, 
and data analytic skills that prevent them from getting the full potential of the data. In this chapter, we 
present a step-by-step protocol to perform transcriptome analysis using GeneCloudOmics, a cloud-based 
web server that provides an end-to-end platform for high-throughput gene expression analysis. 


Key words Synthetic biology, Transcriptomic data analysis, RNA-Seq, Bioinformatics, Biostatistics 


1. Introduction 


The recent rapid increase in the global human population raises the 
demands for food, drugs, and energy. Thus, novel and innovative 
approaches are required to fill the increasing gap between supply 
and demand. Metabolic engineering and synthetic biology are two 
promising fields that hold promising potential in boosting the 
food, drug, and fuel industries [1]. 

Metabolic engineering is the alteration of the metabolism of an 
organism to produce new compounds (protein, enzyme, or metab- 
olite) or to increase the yield of an existing one [2]. On the other 
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hand, synthetic biology is the art of manipulating existing organ- 
isms to give them new abilities [3] or using cell-free systems as a 
bioengineering platform for manufacturing, diagnostics, and 
research applications [4]. These fields have made several contribu- 
tions to industry and research including creating plants that tolerate 
environmental changes [5] or bioremediation [6], developing bio- 
sensors for analyte detection [7, 8], bioproduction of several com- 
modities [9, 10], biofuel production, and several applications in 
regenerative medicine and immunology [4, 11, 12]. 

The processes of manipulating and optimizing the genetics and 
the growth conditions of an organism to increase the production of 
certain substances or add a new ability involve multiple steps and 
can have different scenarios [13]. It starts from optimizing the 
growth of an existing organism all the way to transferring a whole 
gene cluster of a pathway from a selected organism to a model 
organism and tuning the model organism’s genetics and growth 
to increase the yield of a substance or add a new ability [4, 14]. The 
increase in the yield must be maximized so that the production can 
be economically sound. Therefore, these processes aim to increase 
the yield by thousands of folds [15]. To achieve this goal, metabolic 
engineering and synthetic biology utilize modern biomedical 
research techniques. This includes multi-omics approaches (geno- 
mics, transcriptomics, proteomics, and metabolomics), genome- 
editing techniques, systems biology, artificial intelligence, and bio- 
informatics [16]. These techniques aim to define the “system” to 
be manipulated and optimized, which is usually the biological 
pathway(s) for producing the desired substrate or ability. Among 
these techniques, transcriptomics plays a crucial role in identifying 
the pathways of genes involved in them and, therefore, contribut- 
ing to the acceleration of the research and applications within the 
field [17]. 

The outstanding advancements in the genome and transcrip- 
tome sequencing, genome editing, and genetic engineering in the 
last two decades, the significant decrease in the cost of molecular 
laboratory techniques, and the rapid development of new bioinfor- 
matics and computational biology tools and algorithms gave rise to 
different research fields in the biomedical space [18]. Synthetic 
biology is one of the major fields that benefited from these advance- 
ments. It aims to create new biological products (parts, devices, or 
systems) or redesign the existing biological systems to make them 
perform new functions such as producing new compounds that 
they do not produce in nature [19]. The applications of synthetic 
biology cover a wide range of industries including drug and vaccine 
development, research reagent production, biosensing, biofuels, 
and biomaterials [20, 21]. As a result, the recent years experienced 
a noticeable rise in synthetic biology start-ups, as well as adopted by 
large companies in the field of pharma, biotechnology, and 


2 Materials 


Transcriptomics Data Analytics for Synthetic Biology 223 


chemical industries resulting in a multibillion business that is based 
on synthetic biology [22]. 

Transcriptomics is a powerful tool in studying biological sys- 
tems and elucidating gene functions [23]. In synthetic biology, 
transcriptomics plays a crucial role in guiding the design processes 
and the development of new devices or systems [17]. Since tran- 
scriptomics provides a snapshot of global gene expressions profile 
of the cell, analyzing them reveals the genes and pathways involved 
in the investigated process. Comparing gene expression profiles of 
different conditions, treatments, or time points, the differential 
expression analysis (DEA) identifies key genes and pathways that 
can be used to modify biological processes or to add a new product 
or increase the yield of an existing one [24]. Synthetic biology 
utilizes transcriptome analysis to understand the mechanistic bases 
of gene functions and their regulation models which allow altering 
gene regulations or designing a synthetic promoter [17]. It is also 
used to improve medicinal plants [25], enhance our understanding 
to plant signaling pathways by combining transcriptomics and bio- 
sensors [26], find new model organisms and chassis for microalgae 
synthetic biology [27], and produce biofuel and chemicals from 
bacteria [23]. 

The high-throughput transcriptome analysis platforms, such as 
RNA-Seq and microarray, measure the level of expression of 
thousands of genes in multiple conditions, of different develop- 
mental stages, or under different treatment conditions [28]. The 
analysis of this data requires processing the raw gene expression 
data to get the expression levels (e.g., read counts), performing 
filtering and quality control (QC) steps to remove noise and 
low-quality data, preprocessing and normalizing the expression 
levels, statistically analyzing the data, identifying the differentially 
expressed genes (DEGs) between different conditions, and 
performing a functional analysis to elucidate the pathways and 
cellular functions of the DEG [29] (Fig. 1). Such analysis involves 
multiple challenges related to the data size, data quality, statistical 
analysis, visualization, and interpretation of the results using the 
bioinformatics tool [30, 31]. 


In this chapter, we will use data from engineered Arabidopsis cells. 
The data is from a study that aimed to investigate the activities of the 
leucine-rich repeat receptor kinases (LRR-RKs) independent from 
the endogenous receptors. LRR-RKs are large group of receptor 
kinases with over 200 members in Arabidopsis. The authors devel- 
oped a novel synthetic biology tool for investigating LRR-RK signal- 
ing kinases in plants by developing rapamycin-inducible dimerization 
(RiD) receptors that operate under the control of rapamycin (Rap), 
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Fig. 1 An overview of the gene expression profiling workflow. The transcriptomic data (RNA-Seq or microarray) 
generated by the experimental instrument is processed through the quality control (QC) steps. Next, the data 
that passed the QC step is normalized. Multiple statistical tests can be performed on the normalized data. The 
normalized data and different methods are used to infer the differential gene expressions (DGEs). Bioinfor- 
matics analyses on the differentially expressed genes (DEGs) provide functional inference and pathway 
association 
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which avoids interference with endogenous receptors [32]. The data 
consists of five different conditions where the wild-type and the 
engineered cells (RiD-BRI1/BAK1) were untreated (M) or treated 
with rapamycin (Rap) or brassinolide (BL) and two replicates per 
condition (GEO accession: GSE136177). 


Several tools are available for analyzing transcriptomic data in the 
form of R packages, Python libraries, or software tools that use 
existing libraries through a GUI (reviewed by [28]). Nevertheless, 
the analysis of gene expression data remains a burden, its intensive 
statistical and programming skill requirement that many biologists 
who use online biological resources are missing [33]. Furthermore, 
most of the available tools focus on the data preprocessing and DEG 
identification with less focus on the statistical analysis and even less 
attention to the downstream functional interpretation [28]. 

With the challenges that biologists are facing while analyzing 
transcriptomic data in mind, we developed GeneCloudOmics to 
provide a one-stop server that performs that whole analysis [ref]. 
Online biological resources are the easiest resources to be used 
since they are all equipped with GUI that allow the users to perform 
the analysis with minimal computational skills and without local 
installation or the need for programming. GeneCloudOmics was 
developed as an online web server that performs end-to-end tran- 
scriptomic data analysis starting with preprocessing the raw read 
count data, performing different statistical tests, and identifying the 
differentially expressed genes (DEGs) and the downstream bioin- 
formatics analysis of the DEG set. 


GeneCloudOmics supports RNA-Seq and microarray (.cel files) 
data. Both types of data can be either uploaded to the server or 
directly imported from the NCBI (Gene Expression Omnibus) 
GEO database by providing the GEO accession of the transcription 
dataset to the designated form. 


GeneCloudOmics provides four normalization techniques (RPKM, 
FPKM, TPM, RUV) that are commonly used with read counts. The 
normalized data can be plotted against the raw data in box plots and 
violin plots with an option to download the normalized in CSV 
format. 


For both preprocesses and processed transcriptomic data, Gene- 
CloudOmics allows performance of several statistical tests. This 
includes read normalization for the preprocessed data and scatter 
plots, correlations (linear and nonlinear), PCA (2D and 3D), and 
clustering (hierarchical, k-means, t-SNE, and SOM). The results of 
all tests are plotted in a publication-ready quality. 
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3.1.4 Identifying 
Differentially Expressed 
Genes (DEG) 


3.1.5 Interprets and 
Analyzes Gene and Protein 
Lists 


3.1.6 Creates a 
Customized Analysis 
Report 


3.2 Statistical Tests 
in Transcriptomic Data 
Analysis 


3.2.1 Scatter Plot 


GeneCloudOmics provides an implementation of three of the most 
commonly used DEG methods: DESeq2 [34], NOISeq [35], and 
EdgeR [36]. The three methods can be used through a single 
interface. The user selects the method of choice to perform the 
differential gene expression analysis; then GeneCloudOmics pro- 
vides the user with the parameters of the selected method. The list 
of DEGs can be downloaded in CSV format, and the results of the 
differential gene expression analysis can be plotted in volcano and 
dispersion plots. 


The list of the DEGs can be interpreted by GeneCloudOmics using 
11 different bioinformatics tools in order to investigate their func- 
tions, pathways, and disease relevance and study their physicochem- 
ical and evolutionary properties. The bioinformatics interpretation 
features of GeneCloudOmics can also be used independently from 
the transcriptomic data analysis workflow to interpret any given list 
of genes or proteins. The bioinformatics tools of GeneCloudOmics 
set it apart from all the available tools since most of the gene 
differential expression analysis tools do not include bioinformatics 
features for gene set analysis or include a few basic analyses such as 
GO and pathway enrichment [28]. Moreover, GeneCloudOmics 
provides all common protein and gene bioinformatics tools includ- 
ing GO enrichment analysis, pathway enrichment analysis, complex 
enrichment, protein-protein interaction (PPI), protein function, 
protein subcellular localization, protein domains, tissue expression, 
gene co-expression, protein physicochemical properties, protein 
evolutionary analysis, and protein pathological analysis. 


GeneCloudOmics provides the user with the option of creating an 
analysis report that gathers and summarizes the results and plots 
that the user finds interesting. The user can click the “Add to 
Report” option on the left-hand side in all GeneCloudOmics 
tests. This will add the plot and the analysis title to the analysis 
report. At the end of the session, the user can go to the “Analysis 
report” page in the main menu. GeneCloudOmics will generate a 
report that contains the added plots. The report is generated in 
HTML format and can be downloaded as a single PDF file. 


There are several biostatistical tests and data analytics that are used 
in the analysis of transcriptomic data. GeneCloudOmics provides 
ten of the most used tests and analytics in analyzing the data (such 
as PCA and Pearson correlation) and assessing the quality of the 
data (such as noise and entropy analysis). Here, we briefly overview 
each of them. 


The scatter plot compares the level of expression of the genes in any 
two conditions or two replicates. It displays the respective expres- 
sion of all genes in a 2D space. Before creating a scatter plot, it is 
recommended to perform normalization for sequencing depth (see 
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the preprocessing stage below) for this step. Since gene expression 
data is naturally skewed toward very high expression-level regions, 
it is recommended to apply a log transformation to the data to 
capture the whole data range. At GeneCloudOmics, the users can 
choose between natural log, log base 2, and log base 10 and add an 
optional linear regression line to the plot as well. Gene expression 
data are densely distributed in the lowly expressed region, making 
the dots usually indistinguishable in a regular scatter plot. Gene- 
CloudOmics overlay a 2D kernel density estimation on the scatter 
plot to visualize the density of expression level. 


GeneCloudOmics provides several distribution fitting options that 
compare the entire gene expressions to different continuous statis- 
tical distributions, which can be used to test the data and choose a 
nonarbitrary statistically based lower expression cutoff. To visualize 
the comparison, GeneCloudOmics displays the cumulative distri- 
bution function of the preprocessed gene expression data with the 
user-selected theoretical distributions. Once it is confirmed that the 
gene set follows a particular distribution, it would be safe to con- 
clude the validity of the gene expression data. GeneCloudOmics 
also provides a table that shows the best-fitted distribution in each 
sample. 


Pearson correlation measures the linear relationship between two 
vectors. The Pearson correlation coefficient 7 = 1 if the two vectors 
are identical and 7 = 0 if there are no linear relationships between 
the vectors. The coefficient 7 between the two vectors (e.g., the 
transcriptome of two different samples), containing 7 observations 
(e.g., gene expression values), is defined by (for large 7): 


(x — Hx) (y; 7 Hy) 


(X,Y) = 
OxOr 


where x; and y; are the zth observation in the vectors X and Y; 
respectively, fx and ythe mean values of each vector, and o,,and oy 
the corresponding standard deviations. 


Spearman rank correlation is a nonparametric test that is used to 
measure the degree of association between two vectors (e.g., tran- 
scriptome in two different samples). The Spearman rank correlation 
test does not carry any assumptions about the distribution of the 
data and is the appropriate correlation analysis when the variables 
are measured on a scale that is at least ordinal. The following 
formula is used to calculate the Spearman rank correlation: 


6-1 (3 = a 
n(n? — 1) 


A(X, 7) =1 


where 7; and 7, ; are ranks of the 7th gene x; and 4; in vectors X and 
Y, respectively, and is the number of genes in vector (X, Y). 
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3.2.4 PCA 


3.2.5 Heatmap and Gene 
Clustering 


Principal component analysis (PCA) is a multivariate statistical 
technique for simplifying high-dimensional datasets [37]. Given 
m observations on 7 variables, the goal of PCA is to reduce the 
dimensionality of the data matrix by finding 7 new variables, where 
ris less than n. Termed principal components, these 7 new variables 
together account for as much of the variance in the original 
n variables as possible while remaining mutually uncorrelated and 
orthogonal. Each principal component is a linear combination of 
the original variables, and so it is often possible to ascribe meaning 
to what the components represent. A PCA analysis of transcrip- 
tomic data considers the genes as variables, creating a set of “prin- 
cipal gene components” that indicate the features of genes that best 
explain the experimental responses they produce. To compute the 
principal components, the eigenvalues and their corresponding 
eigenvectors are calculated from the ” x m covariance matrix of 
conditions. Each eigenvector defines a principal component. A 
component can be viewed as a weighted sum of the conditions, 
where the coefficients of the eigenvectors are the weights. The 
projection of gene 7 along the axis defined by the jth principal 
component is: 


n 
! P 

ai = AiVy 
t=1 


where »,;is the tth coefficient for the jth principal component, air is 
the expression measurement for gene z under the zth condition, and 
a is the data in terms of principal components. Since V is an 
orthonormal matrix, a’ is a rotation of the data from the original 
space of observations to a new space with principal component 
axes. The variance accounted for by each of the components is its 
associated eigenvalue; it is the variance of a component over all 
genes. Consequently, the eigenvectors with large eigenvalues are 
the ones that contain most of the information; eigenvectors with 
small eigenvalues are uninformative. 


Hierarchical clustering is used to find the groups of co-expressed 
genes [38]. The clustering is performed on normalized expressions 
of differentially expressed genes using Ward clustering method. 
Normalized expression of the jth gene at time ¢; is defined as: 


2; (ti) = (x;j(ti) — ¥) /0 
where x, ¢;) is the expression of the jth gene at time t,, 7 is the mean 
expression across all time points, and ?; is the standard deviation. 
GeneCloudOmics apply hierarchical clustering on the output 
of DE analysis using EdgeR [ref] in the previous section. Alterna- 
tively, the user can carry out clustering independently without 
going through DE analysis by specifying the minimum fold change 


of gene expression between two samples. GeneCloudOmics also 
lists the name of genes for each cluster in the Gene Clusters tab. 
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To quantify between gene expressions scatter of all replicates 
in one experimental condition, GeneCloudOmics compute 
transcriptome-wide average noise for each cell type, defined as: 


jd ae 
Trot ~ es: 
where 7 is the number of genes and 77 is the pairwise noise of the 
ith gene (variability between any two replicates), defined as: 


2 m—1 m 
Pee o> a 2 
1; ~— mm = 1) Des Dulin 


where m is the number of replicates in each condition and Nie is the 
expression noise of the zt gene, defined by the variance divided by 
the squared mean expression in the pair of replicates ( 7,h). 


Shannon entropy [39] measures the disorder of a high-dimensional 
system, where higher values indicate an increasing disorder. The 
entropy of each transcriptome, X, is defined as: 


H(X) =—)—" pli) log p(xi) 


where p(x;) is the probability of gene expression value x = x;. The 
entropy values are obtained through histogram-based partitioning 
approach, and the number of bins is determined using 
Doane’s rule: 0X) = 1 + log2m+log2(1 + |gX|/og), where gX is 
the skewness of the expression distribution of each sample, and 


og=vV6(n — 2)/(n + 1)(n + 3). 


Random forest clustering belongs to the unsupervised learning 
clustering approaches where each sample is clustered into different 
classes, based on their similarity (usually based on Euclidean dis- 
tance) [40]. The random forest algorithm is used to generate a 
proximity matrix — a rough estimate of the distance between sam- 
ples based on the proportion of times the samples end up in the 
same leaf node of the decision tree. The proximity matrix is con- 
verted to a dist matrix which is then input to the hierarchical 
clustering algorithm. The implementation of the random forest 
clustering in GeneCloudOmics is based on [41]. 


A self-organizing map (SOM) produces a two-dimensional, discre- 
tized representation of the high-dimensional gene expression 
matrix and is, therefore, a dimensionality reduction technique. 
Self-organizing maps use a neighborhood function to preserve 
the topological properties of the input gene expression matrix [42]. 

Each data point (one sample) in the input gene expression 
matrix recognizes itself by competing for representation. SOM 
mapping steps start from initializing the weight vectors. From 
there, a sample vector is selected randomly, and the map of weight 
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3.2.10  t-Distributed 
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3.3.1 Gene Ontology (GO) 
Annotation 
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vectors is searched to find which weight best represents that sample. 
Each weight vector has neighboring weights that are close to it. The 
weight that is chosen is rewarded by being able to become more like 
that randomly selected sample vector. The neighbors of that weight 
are also rewarded by being able to become more like the chosen 
sample vector. This allows the map to grow and form different 
shapes. Most generally, they form square/rectangular/hexago- 
nal/L shapes in 2D feature space. 


t-SNE is a dimensionality reduction approach that reduces the 
complexity of highly complex data such as transcriptomic data. It 
visualizes the sample interrelations in a two- or three-dimensional 
visualization. This allows the identification of the close similarities 
between samples through the relative location of mapped points. 
Since t-SNE is nonlinear and able to control the trade-off between 
local and global relationships among points, its visualization of the 
clusters is usually more compelling when compared with other 
methods [43]. GeneCloudOmics introduces an intuitive interface 
that allows performing t-SNE analysis on the processed untrans- 
formed transcriptomic data through entering three inputs: (1) per- 
plexity value, (2) the number of principal components (PC), and 
(3) the number of clusters. The user can also choose to log trans- 
form the data before submission. 


DGE analysis usually outputs a list of genes that are statistically 
determined as differentially expressed. Then, the list of DEGs is 
analyzed, interpreted, and annotated to learn more about the func- 
tions, pathways, and cellular processes that these genes are involved 
in. GeneCloudOmics provides 12 bioinformatics analyses that can 
be performed on a given gene/protein dataset. 


GeneCloudOmics performs GO annotation for a given set of pro- 
teins by reading the GO terms associated with them directly from 
UniProt Knowledgebase [44], then visualizes each of the three GO 
domains (cellular component, molecular function, and biological 
process) in an independent tab in a bar chart, as well as provides the 
annotation results in a downloadable tabular format. 


The pathway enrichment analysis of the DEGs produces a list of 
biological pathways that those genes are statistically determined to 
be involved in, not by chance. This provides the researchers with 
mechanistic insights on how those genes affect cellular functions 
[45]. For a given gene or protein set, GeneCloudOmics uses g: 
Profiler [46] to perform a pathway enrichment analysis and displays 
the results as a network where the nodes are the pathways and the 
edges are the overlap between the pathways. GeneCloudOmics 
uses Cytoscape.JS for the network visualization [47]. The enrich- 
ment results can also be downloaded as a CSV file. 
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Investigating PPIs is one of the essential steps in systems biology 
studies. GeneCloudOmics provides the users with an interface 
where they can upload a set of proteins (UniProt accessions) and 
get all the interactions associated with them. The interactions are 
visualized as a network where the nodes are the proteins, and the 
edges are the interactions, and the node size corresponds to the 
number of interactors of the protein. We use Cytoscape.JS for PPI 
visualization [47]. The results are also displayed as an interaction 
table and can be downloaded as a network or an interaction table. 


GeneCloudOmics provides the user with a complex enrichment 
feature that allows the identification of proteins in the provided 
dataset that are part of a known protein complex. This feature uses 
CORUM databases, which contain curated complex information 
for mammalian proteins [48]. This feature provides the user with 
complex-forming proteins and complex information in the submit- 
ted dataset. 


GeneCloudOmics retrieves the protein function information from 
UniProt of a given protein set (UniProt accessions). The retrieved 
protein functions are displayed in a downloadable tabular format. 


The protein subcellular localization feature of GeneCloudOmics 
provides the user with an interface to get the subcellular localiza- 
tion information for a given list of proteins (UniProt accessions) 
and display the results in a downloadable tabular format. 


GeneCloudOmics provides the users with a protein domain feature 
that connects to UniProt Knowledgebase and retrieves the domain 
information associated with each protein in a given list of UniProt 
accessions. 


The tissue expression feature in GeneCloudOmics provides the user 
with the tissue expression for each protein in a given protein list 
(UniProt accessions) through retrieving this information from 
UniProt Knowledgebase. The result is displayed in a downloadable 
tabular format. 


The co-expression analysis is a common analysis that assesses the 
expression level of different genes to identify simultaneously 
expressed genes. The resultant co-expression networks are used to 
identify functionally related genes or genes being controlled by the 
same transcriptional mechanism [49]. GeneCloudOmics provides 
the user with an interface where they can submit a co-expression 
query to GeneMANIA [50] and then shows the results at Gene- 
MANIA’s website in a new tab. Currently, GeneCloudOmics sup- 
ports queries for nine model organisms including humans, yeast, 
E. colt, C. elegans, Arabidopsis, Drosophila, zebrafish, mouse, 
and rat. 
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3.4.1 The Required Data 


3.4.2 Importing or 
Uploading Data to 
GeneCloudOmics 


For a given set of proteins (UniProt accessions), GeneCloudOmics 
provides the user with the complete sequences of them in a single 
FASTA file and allows the user to investigate their physicochemical 
properties. The physicochemical analysis includes sequence charge, 
GRAVY index [51], and hydrophobicity. 


For a given set of proteins (UniProt accessions), GeneCloudOmics 
provides the user with a phylogenetic and evolutionary analysis that 
includes multiple sequence alignment (MSA) of the protein 
sequences, clustering based on the amino acid sequences, chromo- 
somal location, or gene tree. 


Several diseases are associated with the malfunction of certain genes 
or proteins. The disease-protein association is collected in different 
online resources such as OMIM database [52], DisProt [53], and 
DisGeNET [54]. GeneCloudOmics provides the users with an 
interface that retrieves the disease-protein association from online 
databases for a given list of proteins (UniProt accessions). The 
disease-protein association is visualized as bubble charts that show 
the distribution of the proteins among the disease or the distribu- 
tion of diseases among the proteins. 


In synthetic biology research, transcriptome analysis is one of the 
main approaches that helps provide a deeper understanding of the 
investigated system and in developing new tools. In this section, we 
will demonstrate how GeneCloudOmics can be employed in tran- 
scriptomic data analysis for synthetic biology applications. 


GeneCloudOmics supports RNA-Seq and microarray data 
[28]. The RNA-Seq can be in the form of raw read counts, which 
will go through multiple steps of preprocessing and normalization 
or normalized read counts which will be analyzed directly. The 
microarray data is supported as CEL files. Multiple CEL files can 
be uploaded to GeneCloudOmics as one compressed file. The data 
can also be imported directly from NCBI Gene Expression Omni- 
bus (GEO) database using GEO accession numbers. Figure 2 
shows the different data import and upload methods in 
GeneCloudOmics. 


We downloaded the RNA-Seq data of the engineered Arabidopsis 
cells from GEO, decompressed it, and created a metadata file 
(Table 1). Since the data is raw, we used the raw file (read count) 
upload option (Fig. 3a). This option requires two files: (1) the raw 
read count file and (2) the metadata file (Table 1). Optionally, two 
more files can be uploaded: (1) the gene length file, which is 
required for the RPKM, FBKM, and TPM normalization, and 
(2) the negative controls (e.g., ERCC Spike-In) file, which is 
required for the RUV normalization. Several data upload and 
import alternatives are available as mentioned above (Fig. 3b-e). 
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Fig. 2 An overview of GeneCloudOmics features. (a) RNA-Seq and microarray data uploading and importing. 
(b) Data preprocessing and normalization using four different methods(upper quartile, FPKM, RPKM, and TPM). 
(c) Transcriptomic data analysis that includes DGE analysis and multiple biostatistical tests (distributions 
fitting, scatter plots, correlations, PCA), noise analysis, and clustering methods (hierarchical, k-means, t-SNE, 
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Table 1 
The contents of the metadata file 


Sample Time 
CM1 tl 
CM2 tl 
CRI t2 
CR2 2 
RM1 t3 
RM2 3 
RR1 t4 
RR2 t4 
RB1 t5 
RB2 t5 


Once the data upload is complete, we can start doing the prepro- 
cessing and normalization to normalize the raw data and plot the 
raw data vs. the normalized data. First, go to the tab preprocessing, 
and enter the required values of the minimum value and a mini- 
mum number of columns; here, we are using the default values of 
1 and 2, respectively. Next, choose the normalization method and 
click submit. Each normalization method requires one of the 
optional files, as explained above (Fig. 4a). GeneCloudOmics 
plots the normalized data against the raw data as box and violin 
plots (Fig. 4b and c). To create the plots, go to the corresponding 
tab RLE plot tab and violin plot tab, respectively. 


GeneCloudOmics allows performing several statistical analyses on 
the normalized read count data. In this stage, we will demonstrate 
each of them. 


The scatter plot compared the gene expression profile in two differ- 
ent conditions or between replicates. It is performed on the nor- 
malized data. To create a scatter plot in GeneCloudOmics, perform 
data normalization as described above, and then go to the “Scat- 
tered” link in the “Transcriptome Analysis” menu. Choose the two 
conditions or replicates that you want to compare as X-axis and 
Y-axis, choose the log transformation method, and then click the 
plot button (Fig. 5a). The scatter plot will be displayed showing the 
R-value above (Fig. 5b). 


Fig. 2 (continued) SOM). (d) Bioinformatics analysis of gene or protein list that gene ontology (GO) enrichment, 
pathway enrichment, PPI, complex enrichment, gene-/protein-disease association, protein properties, evolu- 
tionary analysis, and protein pathological analysis 
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Fig. 3 Data upload and import to GeneCloudOmics. (a) Uploading raw RNA-Seg data, (b) uploading normalized 
RNA-Seq data, (c) uploading microarray data (.CEL files), (d) importing data from GEO databases using GEO 
accession, and (e) selecting a dataset from the GEO-imported data 
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Fig. 4 Raw data normalization and plots. (a) Normalization parameters and methods, (b) box plot of raw data 
and normalized data, and (c) violin plot of raw data and normalized data 
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Fig. 5 The scatter plot of gene expression in two conditions. (a) The scatter plot parameters, (b) an example of 
a scatter plot between the gene expression in the wild-type (CM1) and the engineered cells (RR1) 


Distribution Fitting GeneCloudOmics provides six different distribution fitting for sta- 
tistical continuous distributions for gene expression distribution 
comparison. To perform a distribution fitting, perform data nor- 
malization as described above, and then go to the “Distribution” 
link in the “Transcriptome Analysis” menu. Choose from the con- 
dition or replicate that you want to investigate, and then choose the 
statistical distribution(s) from the list of available distributions. You 
can zoom to a range of expressions to investigate its distributions 
(Fig. 6a). After providing all the parameters, click “Plot.” Repeat 
the steps with all conditions or replicates of interests (Fig. 6b, c). 
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Fig. 6 The statistical distribution fitting parameters and plots. (a) The six different statistical continuous 
distributions and the zooming option, (b and c) two distribution fitting results of the gene expression of the 
wild-type and the engineered cells 


Principal Component 
Analysis (PCA) 


GeneCloudOmics enables performing PCA and plots the PCA 
variance, PCA-2D and PCA-3D. To perform PCA on the normal- 
ized data, go to the “PCA” link on the “Transcriptome Analysis” 
menu. Enter the gene sample size, choose the gene sample order 
from the list of provided options, and click “Plot” (Fig. 7a). In this 
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Fig. 7 The PCA parameters and PCA variance plot. (a) The PCA parameters, (b) the PCA variance plot 


Correlation 


example, we are using the default values. The PCA variance tab 
shows a bar plot of the top ten principal components (PCs) 
(Fig. 7b). 

To plot the PCA-2D and PCA-3D, go to the corresponding 
tabs and enter the parameters. You can choose which PCs to be on 
the X-axis and the Y-axis, the gene sample size, the gene sample 
order, and the number of clusters and display the sample name 
(Fig. 8a). In the PCA-3D plot, you need to choose a third PC 
(Fig. 8b, c). 


GeneCloudOmics enables multiple correlation tests, the Person 
correlation test that measures the linear relationship between the 
gene expression in different conditions or replicates as vectors and 
the Spearman rank correlation that measures the degree of associa- 
tion between the gene expression in different conditions or repli- 
cates as vectors. 
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Fig. 8 The PCA-2D and PCA-3D parameters and plots. (a) The PCA-2D and PCA-3D parameters, (b) the 
PCA-2D plot, and (c) the PCA-3D plot 
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Fig. 9 The Pearson correlation analysis. (a) The method selection section, (b) the correlation matrix, (c) the 
Correlation heatmap plot, and (d) the correlation plot 
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To perform a Pearson correlation analysis, normalize the data as 
described above, and then go to the “Correlation” link in the 
“Transcriptome Analysis” menu. Then, in the “Method” section, 
choose “Pearson correlation” and then click “Plot” (Fig. 9a). The 
correlation is plotted as a heatmap or a correlation plot or displayed 
as a correlation matrix (Fig. 9b-d). Each of the outputs can be 
accessed through the corresponding tab. 
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Fig. 10 The Spearman correlation analysis. (a) The method selection section, (b) the correlation matrix, (c) the 
correlation heatmap plot, and (d) the correlation plot 
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To perform the Spearman correlation analysis, normalize the data 
as described above, and then go to the “Correlation” link in the 
“Transcriptome Analysis” menu. Then, in the “Method” section, 
choose “Spearman correlation” and then click “Plot” (Fig. 10a). 
The rest of the analysis and the plot are similar to the Pearson 
correlation (Fig. 10b—d). 


GeneCloudOmics computes transcriptome-wide average noise for 
each replicate/condition to quantify between gene expressions 
scatter of all replicates in one experimental condition. To perform 
the noise analysis, normalize the data as described above, and then 


Shannon Entropy 


3.4.5 Differential Gene 
Expression Analysis 


DE Analysis 


Heatmap 


Transcriptomics Data Analytics for Synthetic Biology 243 


go to the “Noise” link in the “Transcriptome Analysis” menu. 
Then, select the “Anchor genotype,” which will be used for the 
comparison, select the desired plot options, and then click “Plot” 
(Fig. lla). The noise can be plotted as a bar chart (Fig. 11b) ora 
line chart (Fig. llc). 


To measure the disorder of the transcriptomic data as a high- 
dimensional system, GeneCloudOmics computes Shannon entropy 
for each sample (condition or replicate). The higher Shannon 
entropy values indicate an increasing disorder. To perform the 
Shannon entropy analysis, normalize the data as described above, 
and then go to the “Entropy” link in the “Transcriptome Analysis” 
menu. Then, select if your data is time-series data or not and click 
“Plot” (Fig. 12a). The Shannon entropy can be plotted as a bar 
chart (Fig. 12b) or a line chart (Fig. 12c). 


GeneCloudOmics provides an interface for three of the most used 
DE analysis methods (EdgeR, DESeq2, and NOJISeq). In this 
tutorial, we will demonstrate how to perform DE analysis using 
EdgeR since it supports the generation of all plots supported by 
GeneCloudOmics, the volcano plot and the dispersion plot 
(Fig. 13). The volcano plot shows the statistical significance of the 
p-value in relation to the fold change in the gene expression, while 
the dispersion plot quantifies the variance that deviates from the 
mean. 

To perform a DE analysis using GeneCloudOmics, normalize 
the data as described above, and then go to the “DE Analysis” link 
in the “Transcriptome Analysis” menu. Choose the DE analysis 
method from the provided list (here, we will choose “EdgeR”), 
and select the number of replicates in your data (single or multiple) 
(Fig. 13a). DE analysis is performed as a comparison between two 
conditions; hence, you need to choose the two conditions to be 
compared. Here, we are using the wild-type and the engineered 
cells. Finally, you need to provide the DE criteria, the false discov- 
ery rate (FDR), and the minimum fold change, and then click 
“Plot” (Fig. 13a). 

Once the execution is done, the list of DEG, their statistical 
significance, and fold change will be available for download as a 
CSV file. The volcano plot and the dispersion plot can be generated 
by clicking the corresponding tabs (Fig. 13b, c). Both plots can be 
downloaded in a PDF format. 


Heatmaps represent the variance in gene expression using color 
intensity and help visualize clusters of genes with similar expression 
profiles. The map is a grid where each row is a gene and each 
column is a sample (condition or replicate). To perform the heat- 
map analysis in GeneCloudOmics, normalize the data as described 
above, and then go to the “Heatmap” link in the “Transcriptome 


244 Mohamed Helmy and Kumar Selvarajoo 


A 


Select desired noise plot between 
© replicates 
© genotypes (average of replicates) 
@ genotypes (no replicate) 
Anchor genotype 
CM1i + 


Graph type: 
@ Bar chart 
© Line chart 


0.4 


03 
0.2 
O12 

0. 


M2 cRi CR2 RML RM2 RRL RR2 RBI RBZ 


Fig. 11 The noise analysis parameters and plots. (a) The noise analysis parameters, (b) the noise plotted as a 
bar chart, (c) the noise plotted as a line chart 
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Fig. 12 The Shannon entropy analysis parameters and plots. (a) The Shannon entropy analysis parameters, (b) 
the entropy plotted as a bar chart, (c) the entropy plotted as a line chart 
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Fig. 13 The DE analysis methods, parameters, and plot. (a) The DE analysis methods and parameters, (b) the 
volcano plot, and (c) the dispersion plot. Both plots are only available for EdgeR and DESeq2 DE analysis 
methods 
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Fig. 14 The heatmap parameters and plot. (a) The heatmap parameters and (b) the heatmap plot 


Analysis” menu. Then select if you want to create a heatmap plot 
for all genes or the DEG only and the number of clusters, and then 
click “Plot” (Fig. 14a). The heatmap plot will be displayed in the 
“Heatmap” tab (Fig. 14b). The genes of each cluster can be down- 
loaded as a CSV file from the “Gene Clusters” tab. 


Self-Organizing Map (SOM) + =‘The SOM analysis is a dimensionality reduction approach to reduce 
the complexity of the gene expression data. To perform the SOM 
analysis in GeneCloudOmics, normalize the data as described 
above, and then go to the “SOM” link in the “Transcriptome 
Analysis” menu. Choose if you want to perform the analysis using 
one sample or all samples. Then enter the number of horizontal and 
vertical grids, the number of clusters, and the log transformation, 
and then click “Plot” (Fig. 15a). The SOM analysis of GeneClou- 
dOmics provides five plots: (1) the property plot, (2) the count 
plot, (3) the cluster plot, (4) the distance plot, and (5) the code plot 
(Fig. 15b-f). 
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Fig. 15 The self-organization map (SOM) analysis parameters and plots. (a) The SOM analysis parameters, (b) 
the property plot, (c) the count plot, (d) the cluster plot, (e) the distance plot, and (f) the code plot 
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t-SNE is another dimensionality reduction approach that reduces 
the high-dimensional gene expression data to two or three dimen- 
sions. To perform the t-SNE analysis in GeneCloudOmics, normal- 
ize the data as described above, and then go to the “t-SNE” link in 
the “Transcriptome Analysis” menu. Then enter the perplexity 
value, the number of principal components (PCs), and the number 
of clusters. Next, choose if you want to log transform the data or 
not, and click “Plot” (Fig. 16a). The t-SNE plot and the t-SNE 
table are shown in the corresponding tabs (Fig. 16b, c). 


The DE analysis produces a list of DEGs that causes the difference 
in phenotype between samples (conditions or treatments). To 
understand the biology behind this difference, this list of genes 
needs to be annotated and interpreted. Most of the available DE 
analysis tools do not provide bioinformatics tools for gene list 
functional analysis and interpretation or provide basic analysis 
such as GO annotation [28]. GeneCloudOmics provide access to 
11 different bioinformatics tools for the analysis of gene and pro- 
tein lists (Fig. 2). The GeneCloudOmics bioinformatics section is 
designed to be used independently from the DE analysis and the 
biostatistical sections. Thus, a gene or protein list that results from 
any analysis can be analyzed using the GeneCloudOmics bioinfor- 
matics section. 

In this section of the tutorial, we will use the list of DEGs 
resulting from the above analysis, and perform several functional 
annotations. For demonstration purposes, we will use the 50 most 
significant DEG genes from the list. The genes on the list use the 
Arabidopsis mRNA IDs that are not supported by GeneCloudO- 
mics. Therefore, we used the ID converter of g:Profiler to convert 
them to UniProt ID. 


The GO association analysis fishes and sorts the GO terms asso- 
ciated with the genes in the gene list (after ID conversion to 
UniProt protein IDs) and creates three GO plots. To perform 
GO association analysis, go to the “Protein Set Analysis” in the 
main menu, click “Gene Ontology,” and then upload the list of 
UniProt ID as a CSV file or, alternatively, paste the list of IDs in the 
designated text box (Fig. 17a). The paste option supports different 
types of delimiters including space, tab, comma, and new line. 
Therefore, the IDs can be directly copied from spreadsheet software 
(e.g., Microsoft Excel or Google Spreadsheets) or other media such 
as text files. Once the upload/paste is complete, click the “Submit” 
button. GeneCloudOmics connects to UniProt, downloads the 
GO associations, and then creates the GO terms biological process 
plot (Fig. 17b), the GO terms molecular function plot (Fig. 17c), 
and the GO terms cellular compartment plot (Fig. 17d). The 
results can also be downloaded as CSV files for further analysis or 
to be imported to another tool. 
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Fig. 16 The self-organization map (SOM) t-distributed stochastic neighbor embedding (t-SNE) analysis 
parameters and outputs. (a) The t-SNE analysis parameters, (b) the t-SNE plot, and (c) the t-SNE table 
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Fig. 17 The gene ontology (GO) association analysis parameters and outputs. (a) The GO association analysis 
parameters, (b) the biological processes plot, (c) the molecular functions plot, and (d) the cellular 
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Pathway enrichment analysis identifies the pathways that are 
enriched in each list of genes or proteins. This helps in understand- 
ing the cellular pathways and biological functions affected by the 
differential expression of those genes. GeneCloudOmics allows 
performing this analysis using gene names or protein UniProt 
IDs. To perform pathway enrichment analysis using gene names 
or protein IDs, go to the “Gene Set Analysis” or “Protein Set 
Analysis,” respectively, click “Pathway Enrichment,” and then 
upload or paste the list of gene names or UniProt ID as described 
above. Then select the plot style and layout, and choose the mini- 
mum overlap between the query and the pathways (the minimum 
number of the query genes to be in the pathway) (Fig. 18a). The 
pathway enrichment is performed using g:Profiler [46], and the 
results are plotted in as an enrichment plot (Fig. 18b) and can be 
downloaded in CSV format. 


GeneCloudOmics enables downloading and plotting all protein- 
protein interactions (PPI) associated with the given set of proteins 
(UniProt IDs). The PPI is visualized as a network using Cytoscape. 
JS [47] and can also be downloaded as an interaction table. To 
perform PPI analysis, go to the “Protein Set Analysis” in the main 
menu, click “Protein Interactions,” and then upload or paste the 
list of UniProt ID as described above (Fig. 19a). GeneCloudOmics 
connects to UniProt and downloads the PPIs associated with each 
of the provided proteins and then plots the interactions as a net- 
work (Fig. 19b). The style of the network (node and edge appear- 
ance and colors) can be changed from the “Select Style” menu, and 
the network layout can be changed from the “Select Layout” menu 
(Fig. 19a). GeneCloudOmics provides five and ten different net- 
work styles and layouts, respectively. 


Protein functions and subcellular localizations can provide useful 
information on the set of proteins under investigation. GeneClou- 
dOmics provides two tools to access this information from the 
UniProt Knowledgebase. To get the functions or the subcellular 
locations of your proteins, go to the “Protein Set Analysis” in the 
main menu, and click “Protein Functions” or “Subcellular Locali- 
zation,” respectively. Next, upload or paste the list of UniProt ID as 
described above. GeneCloudOmics connects to UniProt and 
downloads the functional and localization annotations associated 
with each of the provided proteins and displays them in tabular view 
(Fig. 20). The results can be downloaded in a CSV format as well. 


GeneCloudOmics also provides access to tools that investigate 
different properties of the proteins including protein physicochem- 
ical properties, sequence properties, and evolutionary properties 
(Helmy 2021). To investigate the physicochemical properties of 
your proteins, go to the “Protein Set Analysis” in the main menu, 
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Fig. 18 The pathway enrichment analysis parameters and outputs. (a) The pathway enrichment analysis 
parameters and (b) the pathway enrichment plot 
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Enter UniProt accession numbers 
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Fig. 19 The protein-protein interaction (PPI) analysis parameters and outputs. (a) The PPI analysis input and (b) 
the PPls visualized as a network with the circle layout 
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Function 


FUNCTION: Catalyzes the C6-oxidation step in brassinosteroids biosynthesis. Converts 6-deoxocastasterone to castasterone, and castasterone to 
Q940V4 (No Match) brassinolide. May also convert 6-deoxoteasterone to teasterone, 3-dehydro-6-deoxoteasterone to 3-dehydroteasterone, and 6-deoxotyphasterol to 
typhasterol. [(ECO:0000269|PubMed:12529536} 


ro 


2 064989 (No Match) FUNCTION: Catalyzes the C22-alpha-hydroxylation step in brassinosteroids biosynthesis. Converts campestanol to 6-deoxocathasterone and 6- 
oxocampestanol to cathasterone. 

FUNCTION: Amino acid-proton symporter. Stereospecific transporter with a broad specificity for histidine, arginine, glutamate and neutral amino 
acids, favoring small amino acids such as alanine, asparagine and glutamine. Accepts also large aromatic residues such as in phenlalanine or tyrosine. 
Has a much higher affinity for basic amino acids as compared with AAP1. May function in xylem-to-phioem transfer and in uptake of amino acids 
assimilated in the green silique tissue. (ECO:0000269|PubMed:7608199, ECO:0000269|PubMed:8281191). 


3 Q38967 (No Match) 


4 Q9FK81 (No Match ) FUNCTION: Involved in stress response. (ECO:0000305). 


5 048766 (No Match) 


Q9SUS9 (No Match ) FUNCTION: Possesses protease activity in vitro. [ECO:0000269|PubMed:23460027]. 


FUNCTION: 1-aminocyclopropane-1-carboxylate synthase (ACS) enzymes catalyze the conversion of S-adenosyt-L-methionine (SAM) into 1- 


7 9S9U6 (No Match) 
Q Sane aminocyclopropane-1-carboxylate (ACC), a direct precursor of ethylene. 


FUNCTION: 6 and 1-fructan exohydrolase that can degrade both inulin and levan-type fructans, such as phlein, levan, neokestose, levanbiose, 6- 


IW4S6 (No Match 
8 SAAMRAIEE —\costose, 1-kestose, inulin, and 1,1-nystose. (ECO:0000269|Ret 5}. 


FUNCTION: Catalyzes the C6-oxidation step in brassinosteroids biosynthesis. Converts 6-deoxocastasterone to castasterone. May also convert 6- 
9 Q9FMAS (No Match) deoxoteasterone to teasterone, 3-dehydro-6-deoxoteasterone to 3-dehydroteasterone, and 6-deoxotyphasterol to typhasterol. 
{ECO:0000269| PubMed:11402205, ECO:0000269| PubMed: 12529536}. 


10 Q07488 (No Match) FUNCTION; Probably acts as an electron carrier. 


Showing 1 to 10 of Si entries Previous 1 2 3 4 5 6 Next 


ID Subcellular.Location 
1 SUBCELLULAR LOCATION: Membrane (ECO:0000305); Single-pass membrane protein (ECO:0000305}. 
2 SUBCELLULAR LOCATION: Membrane (ECO:0000305); Single-pass membrane protein (ECO:0000305}. 
3 SUBCELLULAR LOCATION: Cell membrane {ECO:0000305}; Multi-pass membrane protein (ECO:0000305). 
4 Q9FK81 (No Match) 
5 048766 (No Match) SUBCELLULAR LOCATION: Secreted (ECO:0000250}. 


Q9SUS9 (No Match ) 


o 


7 Q9S9U6 (No Match) 


SUBCELLULAR LOCATION: Secreted, extracellular space, apoplast (ECO:0000305). Secreted, cell wall (ECO:0000305}. Note»Associated to the cell 


ames COW4S6 (No Match) BRIE terest yh 


Q9FMAS (No Match) SUBCELLULAR LOCATION: Membrane (ECO:0000305); Single-pass membrane protein [ECO:0000305}. 


10 Q07488 (No Match) SUBCELLULAR LOCATION: Cell membrane; Lipid-anchor, GPI-anchor. 


Sy 


Showing 1 to 10 of 51 entries Previous i 2 3 4 5 6 Next 


Fig. 20 The protein functions and subcellular localization outputs. (a) The PPI analysis input and (b) the PPls 
visualized as a network with the circle layout 
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3.4.7 Result 
Interpretation 


click “Protein Properties,” then upload or paste the list of UniProt 
ID as described above, and click “Submit.” This feature performs 
three different protein property analyses: (1) protein sequence 
charge, (2) protein sequence acidity and GRAVY index, and (3) pro- 
tein hydrophobicity. Each of them can be accessed through the 
corresponding tab. In addition, the “All physicochemical proper- 
ties” provides a combined analysis of all of them (Fig. 21a). 

The evolutionary analysis provided by GeneCloudOmics 
includes protein’s gene tree, chromosomal location, and protein’s 
phylogenetic tree. To access the protein evolutionary analysis tools, 
go to the “Protein Set Analysis” in the main menu, click “Evolu- 
tionary Analysis,” then upload or paste the list of UniProt ID as 
described above, and click “Submit.” Each of these analyses can be 
accessed through the corresponding tab. Here, we show the output 
of the phylogenetic tree analysis that performs multiple sequence 
alignment (MSA) and then creates the phylogenetic tree (Fig. 21b). 

In all the gene and protein set analysis, you do not need to 
upload your gene or protein list every time you use a new analysis. If 
you are to use the same list in multiple analyses, upload the list 
through the “Upload a Protein List” in the “Protein Set Analysis” 
menu. The uploaded list will be kept until you finish your analysis 
and close the session. When moving from one analysis to the other, 
your protein list will always be pasted in the text box and ready for 
the next analysis. 


In this protocol, we used transcriptomic data from wild-type and 
engineered Arabidopsis cells with ten samples of five different con- 
ditions consisting of wild-type that is untreated (M) or treated with 
rapamycin (Rap) and engineered cells that are untreated (M) or 
treated with rapamycin (Rap) or brassinolide (BL) with two repli- 
cates per condition [32]. We demonstrated how GeneCloudOmics 
can be used to perform transcriptomic data analysis for synthetic 
biology research by preprocessing the data obtained from the GEO 
database, performing several biostatistical tests, identifying the dif 
ferentially expressed genes (DEG) between two different condi- 
tions, and analyzing the list of the DEGs using multiple 
bioinformatics tools. For sample-level analysis, we choose two 
samples: (1) the untreated wild type (CM1) and (2) the 
rapamycin-treated engineered cells (RR1). 

Firstly, the preprocessing of the raw data showed that the data 
was partially preprocessed as the box plots of the raw data and the 
TPM normalized data show similar profiles (Fig. 4b). However, the 
violin plots of the same two data show reduced outliers and better 
distribution of the normalized data (Fig. 4c). To determine the 
expression threshold for low-count filtering, we used the 
transcriptome-wide distribution fitting for each of the two selected 
samples. The transcriptome-wide distribution fitting with six 
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Fig. 21 The protein properties and evolutionary analysis outputs. (a) All the protein physicochemical property 
plot and (b) the protein phylogenetic tree plot 
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different statistical distribution models shows a cutoff of five since 
the larger counts follow all the six statistical distributions 
(Fig. 6b, c). 

Next, we used several biostatistical tests to investigate the 
global relationship between samples, either all samples or pairwise. 
The width of the scatter plot can be used to visualize the variation 
between two samples and to show the amount of toggle genes 
[Sandro 2022]. For the two selected samples, the width of the 
scatter plot shows variability in the expression of several genes, 
and the genes on the two axes are the toggle genes (Fig. 5b). The 
PCA variance plot shows PC1 and PC2 as the main components 
(Fig. 7b). The 2D-PCA clustered all the samples into two clusters 
and shows considerable variations between replicates (Fig. 8b), 
while the 3D-PCA shows less variation between replicates and the 
brassinolide-treated engineered cells (RB1) as the most variable 
sample (Fig. 8c). 

The Pearson correlation and Spearman correlation analyses 
show the high correlation between replicates of the same condition, 
illustrated by the correlation heatmaps (Figs. 9c and 10c) and the 
correlation plots (Figs. 9d and 10d). Furthermore, the wild-type 
samples (treated and untreated) show a considerable correlation 
with the untreated engineered cells but not with the treated engi- 
neered cells. The noise analysis shows a noise level that is around 
0.3 for all samples with a significant increase for the rapamycin- 
treated engineered cells (RR1J and RR2) (Fig. 1 1b, c). While all the 
samples have a Shannon entropy of 0.6 on average, the same two 
samples of the rapamycin-treated engineered cells show an elevated 
Shannon entropy of 0.1 and 0.14, respectively (Fig. 12b, c). 

The differential expression (DE) analysis between the two 
selected samples was performed using the EdgeR method with a 
minimum of twofold expression threshold change and an FDR of 
0.05. The analysis resulted in 3325 DE genes. The results of the 
DRE analysis were visualized using the volcano plot which shows 
the relationship of the p-value and the expression fold difference for 
every gene (Fig. 13b) and the estimation of gene-wise dispersion 
and empirical shrinkage of these estimates to produce a more 
accurate dispersion estimate for actual gene count modeling 
(Fig. 13c). All the DEGs were then analyzed by the heatmap gene 
clustering feature to identify groups of genes with similar patterns 
of gene expression change between samples (Fig. 14b). The analysis 
shows similar global expression patterns in the treated and 
untreated wild-type cells and the untreated engineered cells (CM, 
CR, and RM), while the rapamycin-treated engineered cells (RR1 
and RR2) and the brassinolide-treated engineered cells (RB1 and 
RB2) show two significantly different patterns (Fig. 14b). Gene 
wise, four common expression patterns were observed: (1) genes 
with decreased expression in the CM, CR, and RM samples and 
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increased expression in the other two samples (RR and RB) (Group 
1); (2) genes with decreased expression in the CM, CR, and RM 
samples, increased expressed in the RB samples, and_ highly 
increased expression in the RR samples (Group 2); (3) genes with 
decreased expression in the CM, CR, and RM samples, very 
decreased expression in the RR sample, and increased expression 
in the RB samples (Group 3); and (4) gene with increased expres- 
sion in the CM, CR, and RM samples and with decreased expres- 
sion in the RB sample and very decreased expressed in the RR 
samples (Group 4) (Fig. 14b). 

The list of DEGs was then analyzed using GeneCloudOmics 
bioinformatics features. Since GeneCloudOmics supports gene 
names and UniProt IDs only and the gene list was in Arabidopsis 
transcript IDs, the ID converted of g:Profiler was used to convert 
the Arabidopsis transcript IDs to UniProt IDs. 

The gene ontology (GO) biological process analysis of the top 
50 DEGs shows the enrichment of several brassinosteroid-related 
processes and signaling pathways as well as several metabolic pro- 
cesses related to the cellular response to different stimuli (Fig. 17b). 
The GO molecular function analysis for the same genes shows 
enrichment with functions related to metal ion binding and differ- 
ent hormonal activities (Fig.17c). The GO cellular compartment 
shows that most of the top 50 proteins are from the membrane, 
extracellular, or cell wall areas (Fig. 17d). The protein-protein 
interactions (PPI) feature in GeneCloudOmics was used to obtain 
the PPI associations with the same set of proteins. The analysis 
shows that only a few of those proteins are associated with known 
interactions (Fig. 19b). For pathway enrichment analysis, we used 
the top 100 genes that show enrichment of the brassinosteroid and 
phenylpropanoid biosynthesis pathways and the biosynthesis of 
secondary metabolites (Fig. 18b). 

Investigating the protein’s physicochemical properties shows 
that the proteins that correspond to the top 50 DEGs that are 
charged with a negative charge are more than those with a positive 
charge. In terms of acidity, they are balanced with almost half of the 
protein residues being acidic and half being basic (Fig. 21a). The 
GRAVY index, which indicates the hydrophobicity of the proteins, 
shows that 84% of the proteins are negative (hydrophobic) 
(Fig. 21a). The evolutionary analysis feature presented in Gene- 
CloudOmics is demonstrated by the creation of a phylogenetic tree 
for the top 50 proteins. GeneCloudOmics downloaded the 
sequences, performed MSA, calculated the phylogenetic relation- 
ships between the proteins, and then created and plotted the tree 
(Fig. 21b). 
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4 Notes 


. The metadata file is a required file to run any transcriptomic 


data analysis. It is not usually included with the data. It can be 
easily created in a spreadsheet or a text editor as a table with two 
columns where the first column lists the sample names, exactly 
as in the read count file, while the other column lists the 
relationship between them (time points or treatments). 


. The entries in the second column in the meta-file must start 


with the small letter “t.” 


. The gene length file is required for certain normalization meth- 


ods (see above). This file lists the gene (or transcript) IDs and 
the length of the gene. This file needs to be created, if not 
included, in the source of the data. 


. To create the gene length file, download the genome annota- 


tion of the organism from Ensembl database or other 
specialized databases. The genome annotation is usually in 
the gene-finding format or the generic feature format (GFF). 
The GFF format lists the genome annotation features in a 
tabular format of nine columns. Columns 1, 3, 4, and 5 are 
the important columns to create the gene length file. Open the 
file using a spreadsheet software such as Microsoft Excel and 
filter column 3 (the feature column) selecting the features 
annotated as “mRNA.” Then, add a column called “Length” 
where its contents will be equal to column 5-column 4 (the 
mRNA end-the mRNA start). Then copy columns 1 (the 
sequence columns) and the new column length to a new 
sheet, and rename column 1 to “Gene.” You will have a table 
with two columns, gene and length. 


. The GFF files are usually large file with thousands of rows. In 


most of the cases, the spreadsheet methods might not work 
since the file size will be bigger than what Excel can handle. 
Thus, the use of programming languages, such as R and 
Python, will be the only way to process the GFF file and create 
the gene length file. 


. GeneCloudOmics is a webserver that performs all the analyses 


on-demand, and it does not store any data. The data size and 
the query length play an important role in the time that the 
analysis will take especially in the analyses that GeneCloudO- 
mics performs through connecting to other sources, such as 
UniProt Knowledgebase. Therefore, a very long gene list will 
initiate a very long query which will take long time that can be 
longer than the session expiry time. If you encountered such 
situation, reduce the number of your genes/protein but run- 
ning the analysis multiple times, each of them with a subset of 
the genes/protein. 


5 Conclusions 
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Overview of Bioinformatics Software and Databases 
for Metabolic Engineering 


Deena M. A. Gendoo 


Abstract 


The explosion of the “omics” era has introduced a growing number of sets and tools that facilitate 
molecular interrogation of the metabolome. These include various bioinformatics and pharmacogenomics 
resources that can be utilized independently or collectively to facilitate metabolic engineering across disease, 
clinical oncology, and understanding of molecular changes across larger systems. This review provides 
starting points for accessing publicly available data and computational tools that support assessment of 
metabolic profiles and metabolic regulation, providing both a depth-and-breadth approach toward under- 
standing the metabolome. We focus in particular on pathway databases and tools, which provide in-depth 
analysis of metabolic pathways, which is at the heart of metabolic engineering. 


Key words Bioinformatics, Metabolomics, Pharmacogenomics, Software, Databases, Metabolic engi- 
neering, High throughput, Omics 


1. Introduction 


One can view metabolic engineering as the integration and synergy 
of two main components, which include (1) investigating the meta- 
bolome, which focuses predominantly on understanding the role of 
substrates in influencing a biological (metabolic) process or path- 
way, and (2) engineering the metabolome, which proposes new 
strategies to optimize and target metabolic networks and cellular 
processes. The engineering aspect can encompass varied tactics, 
including reconstruction of metabolic networks, protein or nucleic 
acid engineering, and analysis and manipulation of metabolic fluxes 
[1]. The ultimate goal is the optimization of cellular processes and 
metabolic pathways, for the purpose of producing desired and cost- 
effective chemical compounds at optimal conditions within a spe- 
cific organism [2, 3]. Techniques for engineering the metabolome 
to produce biofuel, chemical, and pharmaceutical products are in 
continuous development and are discussed extensively elsewhere 
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[1, 2, 4]. However, pathway design, construction, and optimiza- 
tion efforts for engineering the metabolome are contingent on a 
thorough understanding of metabolic pathways, genes and sub- 
strates involved, and networks and circuits that regulate the perfor- 
mance and flux of the metabolic network [1, 3]. This review focuses 
on the computational tools and repositories that support research- 
ers in “investigating the metabolome” toward successful engineer- 
ing efforts. 


2 A Modular View of the Metabolome 


The starting model and visualization of a metabolic pathway differ 
than that of other high-throughput omics outputs, such as the 
results generated by sequencing efforts (e.g., RNA-Seq, WGS, 
WES). Metabolic pathways can be visualized using a network or 
map-based approach [1]. The main elements (nodes) of the net- 
work include the metabolites, which are substrates or products of 
metabolism that are responsible for driving cellular functions 
[5, 6]. Other nodes include genes (or their protein products) that 
directly interact with the metabolites or which are indirectly 
affected by the metabolites. This interplay between metabolites 
and genes will also include an implicit directionality (edges), 
which signifies the upstream and downstream ends of the pathway, 
and the production of the metabolites and substrates at intermedi- 
ate steps of the process (Fig. 1). Understanding this directionality is 
relevant to metabolic engineering approaches that rely on native 
cellular pathways for synthesis of desired compounds, or which rely 
on nonnatural pathways that introduce new reactions and new 
chemistry as part of genome-scale metabolic models [3, 7]. 

Given this model of a metabolic pathway, investigating the 
metabolome can take the form of different modes of inquiry. One 
line of inquiry involves a thorough investigation of a given pathway 
and its subcomponents (metabolites and genes) and understanding 
the downstream effect of that pathway on possible genotype or 
phenotypic changes in the cell, the organism, and the metabolic 
system. Another line of inquiry entails investigating the links 
between metabolic data and other data outputs, such as high- 
throughput sequencing, pharmacogenomics, proteomics, or 
omics datasets. The goal is to relate current knowledge that is 
garnered for the metabolic pathway to other genomic or “omics” 
level data, which will provide a comprehensive system-wide view of 
the metabolic network. In addition to a more detailed view of one 
particular pathway, this can also enable comparison of multiple 
pathways in tandem. 
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Fig. 1 The starting model of a metabolic network. Various metabolites (M) either are released as by-products 
of a pathway (P) or contribute and interact with genes (G) toward the production of more metabolites 
downstream. Genes (G) or their protein products are connected by a directionality that indicates the upstream 
and downstream flow of the pathway 


3 Metabolic Pathway Databases and Tools 


3.1 Pathway 
Databases 


The advent of sequencing technologies over the past two decades 
produced a large number of metabolic datasets, as well as datasets 
pertaining to chemogenomics information that contains informa- 
tion about drug compounds and substrates. We highlight several 
key examples that are frequently used by researchers in the field. 


There is a growing list of databases that provide genome-scale 
network information to analyze, visualize, and manage metabolic 
pathways. Popular and highly referenced repositories include 
KEGG [8] and MetaCyc [9] (Table 1), owing to their extensive 
collection of curated metabolic networks that users can query and 
visualize. Some of these databases have appealing features from an 
informatics perspective: 


e Large-scale datasets can be accessed and downloaded via pro- 
grammatic access (e.g., using R and Bioconductor). This avoids 
reliance solely on web-based interfaces, and data can be parsed 
and efficiently integrated as part of larger and more complex 
computational pipelines. 


e The pathway datasets include pathways that span multiple 
organisms or include biochemical pathways that are independent 
of any particular organism. This facilities meta-analytical com- 
parisons of pathway behavior where needed. 


KEGG and MetaCyc describe interactions between enzymes 
and substrates using reaction maps. KEGG [8] is a reference knowl- 
edge base that integrates several databases: PATHWAY database, 
GENES/SSDB/KO database, and COMPOUND/REACTION 
database. These databases are represented as graph objects that 
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Table 1 


Overview of genome-scale databases for metabolic networks 


Database Computational access and highlights 


KEGG e 
PATHWAY 


Website (https://www.genome.jp/kegg /pathway.html) 

Pathways can be downloaded GMT files as part of MSigDB 

Data is stored in the form of graph objects that reflect upon proteins, chemical 
compounds, and genes 

Parsed using R and Bioconductor packages 


BioCarta Pathways can be downloaded GMT files as part of MSigDB 


Reactome 


MetaCyc 


MSigDB ° 


3.1.1 Informatics Access 
of Metabolic Networks: An 
In-Depth Example 


Website access 

Parsed using R and Bioconductor packages 

Pathways can be downloaded GMT files as part of MSigDB 

The developer’s portal (https: //reactome.org/dev) provides pathway widgets 
that users can incorporate into their web applications and analysis service 
(including API) where users can analyze their own data against the Reactome 
database or access the Reactome content as interconnected graphs 


Website access 
R and Bioconductor packages 
GMT files as part of MSigDB 


GMT files can be downloaded for canonical pathways (CP), containing genesets 
from KEGG, BioCarta, PID, Reactome, and the WikiPathways databases 


contain information about pathways and complexes, genes and 
proteins, and biochemical compounds and reactions, respectively. 
The KEGG PATHWAY database in particular contains a collection 
of manually drawn networks that represent metabolic pathways; the 
metabolic networks are viewed as networks of enzymes and 
provided as reference pathways that are not necessarily unique to 
any particular organism [1, 8]. This modular format provides flexi- 
bility and allows parsing of the metabolic networks using a variety 
of computational tools. MetaCyc [9] is utilized for pathway predic- 
tion as part of the BioCyc dataset [10]. In contrast to KEGG, 
MetaCyc contains organism-specific metabolic network diagrams, 
with taxonomic information stored as part of their pathway anno- 
tation [11]. MetaCyc can be accessed from pathway and genome 
databases (PGDBs) such as BioCyc [10]. Enzyme and reaction 
information in MetaCyc can be accessed using the Pathway Tools 
software, a cross-platform program that powers both MetaCyc and 
BioCyc. 


Rendering of the metabolic networks as reaction maps and graph 
objects provides ample flexibility and versatility in terms of how 
these pathways are accessed, visualized, and integrated into bioin- 
formatics and engineering analysis pipelines. As an example, we 
focus on some of the computational access options provided for 
KEGG (Table 1). 
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Identifier start 
011 = global map 
012= overview map 


010 = chemical structure map 


07 = drug structure map 


pitt X+ 
LY — 


Prefix Integration 


map = manual NETWORK databases in KEGG 01250MR 


reference M= module 


org = organism-specific R = reaction module 
pathway N = network 


1. Metabolism 


1.0 Global and overview maps 


01100M Metabolic pathways 
01110M _ Biosynthesis of secondary metabolites 
01120™M_ Microbial metabolism in diverse environments 
01200 MR Carbon metabolism 
01210 MR 2-Oxocarboxylic acid metabolism 
01212 MR Fatty acid metabolism 

with MODULE and 01230 MR Biosynthesis of amino acids 

Biosynthesis of nucleotide sugars New! 

01240 MR Biosynthesis of cofactors 

01220 MR Degradation of aromatic compounds 


Fig. 2 Rendering of pathway maps using KEGG. Pathway maps are labeled with a five-digit number and a two- 
to four-letter prefix code (left panel). This rendering facilitates easy access to pathways that are global or 
organism-specific and the extraction of global, chemical, and drug structure maps. A representative snapshot 
of identifiers for global metabolic pathways is provided (right panel) 


Computerized access to the KEGG resources is possible 
through the KEGG API (application programming interface) and 
is provided for academic use. 

Using the API, we show a quick example of how to download a 
pathway map for the metabolic pathway related to “terpenoid 
backbone biosynthesis.” Our lab has recently focused on this meta- 
bolic pathway as a starting point to identify synergistic drug com- 
binations for targeting breast cancer using integrative 
pharmacogenomics datasets [12]. These simple steps provide a 
quick starting point to access the desired pathway: 


1. The list of all metabolic pathways under the KEGG PATHWAY 
can be identified from https://www.genome.jp/kegg/path 
way.html#metabolism. 

Pathway maps are labeled using a five-digit number and a 
prefix code (two to four letters) (Fig. 2). 


2. Using the REST API, one can search for all pathways pertain- 
ing to “‘terpenoid” using the “‘find” option of the URL: 
http://rest.kegg.jp /find/<database>/<query> 

A full description of the <database> and <query> para- 
meters is provided under the KEGG API descriptions online: 
https: //www.kegg.jp/kegg/rest/keggapi.html. 

In our example, the <database> refers to a “pathway,” and 
the <query> refers to any instance containing “terpenoid.” As 
such, the search would be conducted as: http://rest.kegg.jp/ 
find /pathway /terpenoid 


3. Our search indicates that terpenoid backbone biosynthesis is 
listed as path:map00900. 

Accordingly, to select for this map, use the “map” prefix 
along with the five-digit number of the designated pathway, 
“00900,” to access the KEGG entry for this pathway directly: 
https: //www.kegg.jp/entry/map00900 
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3.1.2 Analysis of 
Pathway Enrichment 


BiocManager: :instal1C"KEGGREST" ) 
LibraryCKEGGREST) 

keggFind("pathway", "terpenoid") 
QueryPathway<-keggGet ("path:map00900" ) 


Fig. 3 Sample R commands for installation of access of the “terpenoid backbone 
biosynthesis” pathway, using the KEGGREST package in R 


Notably, KEGG has been readily incorporated in the Biocon- 
ductor project <https://www.bioconductor.org>, which contains 
a compendium of software, data, and annotation packages. The 
KEGGREST package provides quick access to the KEGG REST 
API using R and Bioconductor and includes utilities to search 
identifiers and link with other databases. Following the example 
above, the same entry can be accessed using several R commands 
which install the KEGGREST package into R, query pathway 
information in KEGG, and extract the object containing the desig- 
nated pathway (Fig. 3). 

There are several Bioconductor or CRAN packages that also 
allow users to access KEGG in a variety of formats. These include 
packages for both parsing KEGG and other compound databases, 
pathway visualization or network-based analyses of metabolites, 
and pathway enrichment analyses [13]. KEGG pathway maps are 
encoded using the KEGG Markup Language (KGML). KGML 
contains specifications of the graph objects in KEGG, which allows 
users to manipulate or reconstruct the KEGG pathway [8]. The 
KEGGgraph [14] and Pathview [15] packages enable the parsing 
and loading of KGML encoded data for every pathway, which can 
be supplied as a ‘KEGG Pathway’ object for manipulating the graph 
object within R [16]. 

Using KEGGgraph, users can parse pathways that are rendered 
under KEGG PATHWAY, including protein and chemical net- 
works (see Fig. 4 for a representative image of a network). For a 
protein network under KEGG PATHWAY, this consists of gene 
products connected by “relations” (edges). For a chemical net- 
work, connectivity between chemical compounds is illustrated by 
“reactions” [14]. Metabolic networks can be viewed as both pro- 
tein and chemical networks, which encapsulate the network of 
proteins (enzymes) and chemical compounds involved [14]. 


In addition to informatics access of metabolic networks as reaction 
maps, the classical approach to pathway analysis includes assessment 
of pathway enrichment. This is a statistical calculation that informs 
whether a given set of genes in a sample are enriched for a known 
pathway, in this case, a metabolic pathway [16-18]. Resultant 
enrichment scores are used to indicate overall whether a metabolic 
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Fig. 4 Representative example of a metabolic KEGG pathway for nitrogen metabolism. The pathway can be 
directly accessed from <https:/www.genome.jp/kegg-bin/show_pathway?ko000910>. Part of the pathway is 
shown, with genes represented in purple boxes and metabolites as small circles 


3.2 Drug Compound 
Databases 


pathway is significantly regulated in the system being studied 
[19, 20]. Many of the metabolic pathway repositories aforemen- 
tioned (including KEGG, Reactome, MetaCyc) have individual 
pathways stored as genesets (Table 1) within MSigDB 
[20, 21]. GMT files containing these genesets can be easily ported 
as part of several computational pipelines for overrepresentation 
analyses [18] by single-sample GSEA [19] or GSEA [20]. Other 
advanced algorithms and tools also include taking this further by 
including topology-based pathway enrichment, such as SPIA [22]. 


There are a number of repositories that host information about 
drug compounds, chemical substrates, bioactive molecules, as well 
as behavior of these molecules (such as the mechanism of action) 
where available. These repositories can be mined to learn about 
metabolites and by-products that are involved within metabolic 
pathways of interest, and therefore present a source of complemen- 
tary and necessary information alongside metabolic pathway data- 
bases. Accordingly, these datasets also play a significant role for 
“chemogenomics” and chemoinformatics studies that are implicit 
to metabolic engineering. Chemogenomics is an inclusive term that 
involves the screening of all possible chemical compounds against 
the universe of potential targets (proteins and drug targets) 
[23]. We highlight two of these largest datasets below. 

ChEMBL: ChEMBLisa large drug discovery, manually curated 
database that hosts information about bioactive molecules 
[24, 25]. The database is hosted by the European Bioinformatics 
Institute (EBI), which is part of the European Molecular Biology 
Laboratory (EMBL). Articles across several medicinal chemistry 
journals are mined to extract new information about bioactivity 
data for small molecules or peptides and stored as part of the 
database [24]. This includes curated linkage between 2D chemical 
structures and designated targets, alongside other information 
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about the drug properties (logP, molecular weight, Lipinski para- 
meters) [26]. Additionally, bioactivity data and screening results 
from other databases (PubChem, BioAssay) are also incorporated 
[24]. Collectively, this facilitates a number of varied investigations / 
applications, which include analyzing selectivity and off-target 
effects of drugs, identifying suitable drugs for a designated target, 
and investigating bioactivity information that was collated from 
existing experiments [24, 26, 27]. ChEMBL can be accessed at 
<https://www.ebi.ac.uk/chembl/>. The latest release (ChREMBL 
29) spans over two million compounds with associated collated 
information from over one million assays. 

PubChem: PubChem was developed in 2004 as a public reposi- 
tory hosted by the National Center for Biotechnology Information 
(NCBI), part of the National Institutes of Health (NIH) [26, 28, 
29]. The repository contains three component databases: Sub- 
stance, Compound, and BioAssay. The Substance database contains 
depositor-provided chemical data, such as data provided by aca- 
demic laboratories, pharmaceutical companies, or governmental 
research institutes [26, 28]. PubChem Compound stores internally 
reviewed chemical information that is extracted from the Substance 
database. The BioAssay database contains bioactivity screening 
studies of small molecules [26, 29]. PubChem is one of the most 
visited chemistry websites in the world, owing to the sheer volume 
of data collated from varied data sources and its growing compen- 
dium, including the very recent addition of chemical information 
from 100 new data sources [29 ]. 


4 Linking Metabolomic Data with High-Throughput Omics Profiles 


A growing range of sequencing technologies now facilitates the 
collection of large, high-throughput datasets that can be used to 
mine disease [30, 31] but also integrate several of these datatypes 
with metabolite data. One of the prominent examples of this 
merger of metabolomic profiling with other “omics” datasets is 
demonstrated by the L1000 and CMAP datasets [32]. These data- 
sets contain genotypic information pertaining to drug-treated can- 
cer cell lines, allowing users to quantify gene expression changes 
that occur due to treatment by drugs and experimental compounds 
[31]. As part of this growing effort, the KEGGlincs package 
<https://www.bioconductor.org/packages/devel/bioc/html/ 
KEGGlincs.html>, for example, allows users to load KEGG 
PATHWAY files in R, alongside key information pertaining to the 
behavior of genes within the pathway, based on knockdown experi- 
ments. This provides complementary genotypic information that 
can be merged with metabolite datasets, to provide a more com- 
prehensive picture of metabolic behavior. 
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5 Conclusions and Future Directions 
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Computational Simulation of Tumor-Induced Angiogenesis 


Masahiro Sugimoto 


Abstract 


Cancer cells require higher oxygen levels and nutrition than normal cells. Cancer cells induce angiogenesis 
(the development of new blood vessels) from preexisting vessels. This biological process depends on the 
special, chemical, and physical properties of the microenvironment surrounding tumor tissues. The com- 
plexity of these properties hinders an understanding of their mechanisms. Various mathematical models 
have been developed to describe quantitative relationships related to angiogenesis. We developed a three- 
dimensional mathematical model that incorporates angiogenesis and tumor growth. We examined angio- 
poietin, which regulates the spouting and branching events in angiogenesis. The simulation successfully 
reproduced the transient decrease in new vessels during vascular network formation. This chapter describes 
the protocol used to perform the simulations. 


Key words Tumor, Cancer, Angiogenesis, Systems biology, Simulation 


1 ‘Introduction 


The microenvironment surrounding tumors is one of the key 
factors that determine the destiny of tumor growth. Tumors are 
exposed to low oxygen levels (hypoxia), leading to high metabolic 
stress [1]. To stably obtain oxygen and nutrition, tumor angiogen- 
esis factors (TAFs) [2], such as vascular endothelial growth factor 
(VEGF) [3], are secreted from the tumor to surrounding regions to 
stimulate existing blood vessels to induce the development of new 
blood vessels (angiogenesis) [4]. Special conditions, such as the 
location of preexisting blood vessels, the distance between these 
vessels and tumors, the extracellular matrix, and the distribution of 
secreted TAFs, formulate the vascular networks [5 ]. Chemical reac- 
tions and changes in the physical properties caused by the tumor 
and distorted space also contribute to the network formation. 
Bevacizumab (Avastin) is used to prevent this angiogenesis phe- 
nomenon and inhibit the supply of molecules that will be used for 
tumor growth [6-8 ]. However, the prediction of treatment efficacy 
is still difficult because of the dependency of angiogenesis on 
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2 Methods 


2.1 Protocols 


multidisciplinary features. Therefore, understanding the relation- 
ship of this biological process is important but difficult. 

Various mathematical simulation models have been developed 
to understand tumor-induced angiogenesis [9, 10]. The available 
models are classified into three types: angiogenesis [1 1-15], tumor 
growth [16-19], and integration of both [20-25]. Conventional 
models are implemented in two-dimensional spaces because of their 
high computational cost. Recently, models implemented in three- 
dimensional space have become available [26-28]. In addition, 
recent models incorporated various factors compared to the con- 
ventional models in terms of chemical and physical properties, 
considering only a few factors. Both of these improvements 
would contribute to more reproducible and realistic angiogenesis 
processes. 

Tang et al. developed a model three-dimensional space to 
reproduce angiogenesis and tumor growth considering the chemi- 
cal, physical, and special properties of the tumor microenvironment 
[29]. This model implemented the spouting of a new blood vessel 
depending on the TAF concentration. However, the spouting 
mechanism is more complex and depends on the angiopoietin 
family [30]. Angiopoietin is expressed in vascular endothelial cells 
and regulates the adhesion between vascular wall cells and endo- 
thelial cells. Angiopoietin-1 (Ang-1) promotes endothelial-parietal 
cell adhesion and vascular maturation by binding to the receptor 
tyrosine kinase Tie-2. Ang-2 is an antagonist of the Tie-2 receptor, 
which weakens cell-cell adhesion. Therefore, the balance between 
Ang-1l and Ang-2 controls the stabilization and remodeling of 
blood vessels and capillary sprouting [31]. 

We modified the model of Tang et al. to implement Ang-1 and 
Ang-2 as the regulatory functions of vascular flexibility 
[32]. Through the angiogenesis process, temporal regression of 
blood growth has been observed in vivo [33]. Our model success- 
fully reproduced the simulation of this phenomenon. 


The physical and chemical processes involved in tumor growth and 
angiogenesis are described here. The overall concept of a tumor, 
including preexisting blood vessels, new blood vessels, and 
distributed molecules, is depicted in Fig. 1. The distribution of all 
molecules is described in the partial differential equations. For each 
time step, the pressure gradient and the distribution of each factor, 
such as oxygen, were calculated, and the status of new blood vessel 
formulation and tumor growth was updated. An overview of each 
step is provided below. 
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Preexisting vessel 


O, Ang-1, and Sprouting 
Ang-2 secretion at TEC 
Branching New blood 
vessel 


VEGF gradient 
Secretion 


of VEFG and CO, Tumor 


Fig. 1 The overall concept of tumor-induced angiogenesis. The top red area is the preexisting vessel and the 
center circle is a tumor. The preexisting vessel and tumor distribute various elements, such as VEGF and CO>. 
Angiopoietins (Ang-1 and Ang-2) contribute to the sprouting and branching of new blood vessels 


200. Discretize each axis 
to 200 grids 


z-axis 100 


Define the initial 


Initial cancer cells (n=5) O \ preexistent blood 


at x=100, y=100, z=100 ‘ silaiiad 
100 \ + 
X-axis \ 200 
200 9 100 
Y-axis 
Fig. 2 Initialization of the simulation space 
2.1.1 Initialization 1. Initialize the simulation space and discretize the space (e.g., 


200 x 200 x 200 grid space). 


2. Place fine cancer cells at the center of the computational 
domain. 


3. Place the preexisting blood vessels. The prepared blood vessels 
should have a distance from the center of the cancer cells (Fig. 2). 


4. Calculation: Each of the 33 simulation steps is calculated. The 
33 steps comprise 1 day. For each step, the pressure gradient 
and factor distribution are calculated, and the new blood vessel 
and tumor cells are subsequently updated. The dependencies of 
the variables are shown in Fig. 3. 
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Fig. 3 Overall processing and parameter dependencies at each step of the simulation. The process starts from 
the tissue-level calculation to cell-level ones, which includes pressure, factor distribution, angiogenesis, cell 
phenotype, and tumor growth. For example, in the pressure calculation, (1) CTP, (2) VTP, and (3) interstitial 
fluid velocity are calculated. Subsequently, distributions of various factors (e.g., O2 and CO>) are calculated. 
TAF, Ang-1, and Ang-2 contribute to the vessel formulation and these processes form a loop. O02 and CO. 
contribute to the cell phenotype and tumor growth. CTP and VTP indicate cell-induced tumor pressure and 
vascular perfusion-induced tumor pressure, respectively 


5. Pressure: The tumor microenvironment pressure is calculated 
by combining cell-induced tumor pressure (CTP) and vascular 
perfusion-induced tumor pressure (VIP). The CTP at a spatial 
point is calculated as the sum of the pressures caused by the 
surrounding N tumor cells. VIP is calculated as the sum of 
pressures caused by vascular endothelial cells present next to 
the k point Xo, using a method similar to that used for CTP. 
Pressure (P) is calculated by adding CTP and VTP in each grid 
point. 


6. Oxygen (O2): Oz diffusion is calculated. Oz is assumed to be a 
nutrient necessary for tumor growth. The four processes 
involved are O , diffusion, convection, O 2 secretion from 
blood cells, and Oz consumption by tumor cells. The spatial- 
temporal evolution for O2 concentration (7) is calculated as: 


On 2, ey ae 
or D,Von — Vii unin) 


+ Pyl ys (pv — p))’Zy — An(Ai)°Qr, (1) 


where 7 is the O2 concentration, D,, is the diffusion constant of Oz, 
uw is the pressure gradient, A; is the cellular activity of tumor cells, 7, 
is the removal term, and °Zy and °Q, indicate the activity of all 
vascular cells and all tumor cells, respectively. 
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The third term is the kinetics of O2 secretion from blood cells, 
calculated as: 


PulTrs (pr — p)) = pr RiW, (2) 


where Pyo is the O, supply rate, R; is the radius of the vessel, and 
Wis the pressure gradient in the arterial wall. 
The fourth term is O2 consumption, whose rate is calculated as: 


An( Ai) = Ano Ai, (3) 
Ai=sT4 exp ( 5(w i), (4) 


where A,,9 is the O2 consumption rate and w is the concentration of 
carbon dioxide (CO3). 


(a) Cell activity: Calculating the cell activity determines the activ- 
ity status of each tumor cell. Cell activity is calculated accord- 
ing to the nutrient acquisition status estimated from the O» 
concentration. The activity of the cells between the active and 
quiescent states is reversible. Based on Eq. 4, the cell activity is 
calculated as: 
A; > 0.5 (active) (5) 
A; <0.5 (quiescent) | 


(b) Cell vital energy (CVE): CVE is the energy stored in the cell 
for proliferation. This concept explains tumor cell progression 
for proliferation, as well as cell life and death. The derivative 
value of CVE is calculated by the cell activity with positive 
active kinetics and negative quiescent kinetics as follows: 


Ay ; 
AV _ A+1 ky (active) (6) 
at 
=k; (quiescent) 


Based on the cell activity and CVE, the tumor cell status is 
determined to be active, necrotic, or quiescent (Fig. 4). 


Cell Activity > 
CVE>CVE,, Cell Activity, 


Calculation 


Quiescent 


Fig. 4 The change of tumor status is based on cell activity and cell vital energy 
(CVE). Cell activity, and CVE,, indicate a predefined threshold of these para- 
meters. Necrotic, quiescent, and active are the possible statuses of a cancer cell 
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2.2 Implementation 


(c) 


Tumor angiogenesis factors (LAFs): The distribution of TAFs 
in the simulation space is determined. Among various types of 
molecules of TAF, only VEGF is considered. The concentra- 
tion of VEGF (c) is calculated by diffusion, convection, secre- 
tion by tumor cells, and removal from the vessels as follows: 
ce = DV*c — Vail uiic) + p,(n)Qr — A ry)’ZrEc, (7) 

where D, is the VEGF diffusion constant and °Lrgc¢ is the 
occurrence of the tip endothelial cells (TEC). 


The rates of VEGF secretion and consumption are assumed to 


be proportional to Oz concentration and vessel radius, respectively. 


(d) CO, in the simulation space is calculated. The calculated 


process of CO, kinetics includes diffusion, convection, secre- 
tion by tumor cells, and removal by blood cells like that used 
for VEGE. 


(e) Ang-1 and Ang-2: Angiopoietin is a capillary sprouting angio- 


7s 


genesis factor. The angiopoietin family regulates the adhesion 
levels between vessel wall cells and endothelial cells by binding 
to the Tie-2 receptor-type tyrosine kinase. Tie-2 is expressed 
in vascular endothelial cells and regulates angiogenesis. Ang-1 
stimulates Tie-2 on vascular endothelial cells during angiogen- 
esis. Ang-2 inhibits Ang-1 binding to Tie-2. Therefore, the 
balance between Ang-1 and Ang-2 governs the vascular state. 
The distributions of Ang-] and Ang-2 are calculated using 
diffusion, convection, secretion, and elimination terms, such 
as VEGF and Op). The secretion rate of angiopoietin is 
assumed to increase with the density of endothelial cells or 
TECs. The consumption rate is determined based on the 
concentration of angiopoietin. 


Sprouting and regression: The formulation of blood vessels is 
updated. TAF promotes the sprouting of new blood vessels. In 
addition, capillary sprouting and branching are determined by 
the distance of each cell from the tumor and the balance of 
angiopoietin concentrations. Figure 5 shows the relationships 
among the described factors. 


Stop: The simulation is stopped after 60 days. 


All simulations described in this manuscript were performed using 
MATLAB R2019B (MathWorks, Natick, MA, USA) software. The 
simulation environment used an INTEL XEION CPU E3-1230 


V2 


run 


3.30 GHz, RAM 20 GB memory computer. Each simulation 
takes approximately 2 days. The source code (available upon 


request) is run in MATLAB. 
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Fig. 5 Flowchart of new blood growth. VEGF;, and Days;, indicate the thresholds of these parameters 
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Fig. 6 Snapshot of the spatial distribution of preexisting and new blood vessels and tumors from day 0 (a) to 
60 (j) 


2.3 Simulated Simulation models were developed for 60 days. Figure 5 shows the 
Results preexisting blood vessels (red curves), newly developed blood ves- 
sels (blue curves), and tumors (brown circles). Preexisting blood 
was defined as the initial condition (Fig. 5a). Initially, the tumor 
grew without new blood vessel development (Fig. 5b). New blood 
vessels subsequently developed, and several reached the tumor 
(Fig. 5c-f). The tumor size increased. The new blood vessels were 
remodeled and showed a temporal regression (Fig. 5g). Finally, 
new blood vessels developed again, and the tumor grew rapidly 
(Fig. 5h-). 
The distribution of VEGF at the cross-section (y = 100) is 
shown in Fig. 6. VEGF was not observed in the initial stage 
(Fig. 7a). The VEGF concentration gradually increased, and the 
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Fig. 7 Temporal change of distribution of VEGF at the cross-section (y= 100) from day 0 (a) to 60 (j) 


distributed area also expanded (Fig. 7a—-f). Subsequently, the con- 
centration and distributed area became more stable (Fig. 7g-4). 
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Computational Methods and Deep Learning for Elucidating 
Protein Interaction Networks 


Dhvani Sandip Vora, Yogesh Kalakoti, and Durai Sundar 


Abstract 


Protein interactions play a critical role in all biological processes, but experimental identification of protein 
interactions is a time- and resource-intensive process. The advances in next-generation sequencing and 
multi-omics technologies have greatly benefited large-scale predictions of protein interactions using 
machine learning methods. A wide range of tools have been developed to predict protein-protein, pro- 
tein-nucleic acid, and protein-drug interactions. Here, we discuss the applications, methods, and challenges 
faced when employing the various prediction methods. We also briefly describe ways to overcome the 
challenges and prospective future developments in the field of protein interaction biology. 


Key words Deep learning, Machine learning, Interaction, PPI, Protein networks, Neural networks 


1. Introduction 


The discovery of the DNA structure in 1953 prompted multiple 
studies into macromolecules and their effects on the various prop- 
erties of life. Roles of other types of biomolecules in the cell were 
established — RNA as intermediates and proteins as the effector 
molecules. However, further investigations revealed other complex 
types and functions of these macromolecules. Since the start of the 
twenty-first century, with the development of various sequencing 
programs and platforms, multiple databases to store biological data 
have been established. With a surge in biological data, bioinformat- 
ics has become an essential component of natural sciences. To 
understand the various biological processes, it is also necessary to 
identify the components, their roles, and their relationships. 
Biological systems are a complex web of interactions — gene 
transcription, metabolic signaling, and protein-protein interac- 
tions, to name a few layers that stack up to form an organism. 
Identified with philosopher Descartes, the reductionism approach 
asserts that a complex system or situation can be better analyzed by 
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1.1 Protein 
Interaction 
Identification Methods 


reducing it to a sum of simpler parts. Similarly, in biology, higher 
levels in a hierarchy can be understood by studying the individual 
components. In the current post-genomics era, as more omics data 
is generated, there is a dearth of appropriate representations to 
describe biological networks. Individual members of a biological 
network, like genes, proteins, or drugs, can be represented as 
nodes, while the edges represent the various physical, biochemical, 
or functional interactions between the nodes. Whole cells and 
organisms can be represented by biological networks — collectives 
of features and behaviors — and can be systematically studied to 
predict associations between molecules, genes, diseases, and drugs 
and their targets. However, since the nature of biological data is 
complex and dynamic, the analysis requires multidisciplinary 
approaches. 


Protein interactions play a vital role in the formation of structures 
and enzymatic regulation in a cell, maintaining homeostasis in the 
organism. Predicting the functions and interactions of proteins is 
among the most crucial pursuits in biology, yet most bioinformatics 
solutions are template-based algorithms. Although various bioin- 
formatics approaches are available, they are limited by the accuracy 
of the prediction model, and hence, experimental methods are still 
considered more reliable [1, 2]. 

X-ray crystallography is a preferred method for determining 
full-atom coordinates of a protein complex; however, it is costly 
and time-intensive [3]. Moreover, not all proteins or protein com- 
plexes get crystallized easily. Although the other experimental tech- 
niques do not provide atomic-level information, they are more 
popular because of their reliability over crystallography. Interaction 
detection approaches can be classified as either of the many types 
(Table 1). Depending on the organism and the goal, various exper- 
imental techniques are available to detect and identify protein 
binding events [4-6]. Each technique has advantages but requires 
specific instruments and extensive knowledge for result analysis. As 
the field is developing, new and improved methods to reliably 
predict interactions are emerging. Yet, with the upsurge in available 
omics data in the recent past, in silico methods will need to play an 
increasing part in determining the protein interactions at the 
atomic scale. 

The predictive methods to model protein complexes can be 
categorized into (i) homology modeling and (ii) ab initio or 
template-free docking. Template-based predictions depend heavily 
on the presence of similar structures reported in literature or pro- 
tein databases. Since template-based methods are limited by the 
number of quaternary structures available, ab initio methods are 
gaining traction with the increasing number of macromolecule 
sequences available. Protein domains and chains are known to be 
dynamic and undergo multiple conformational changes, reducing 


Table 1 
Different experimental techniques for detecting protein interactions 


Interaction method detection type Name Reference 
Biochemical Affinity technology WS), 
Aggregation assay [176] 
Chromatography technology [175] 
Cosedimentation [75] 
Cross-linking study [175] 
Electrophoretic mobility-based method [177] 
Enzymatic study [175] 
Probe interaction assay [175] 
Biophysical Biosensor [178] 
Circular dichroism [179] 
Mass spectrometry [177] 
Equilibrium dialysis [180] 
Filter trap assay [181] 
Fluorescence technology [176] 
Infrared spectroscopy [182] 
Intermolecular force [183] 
Isothermal titration calorimetry [184] 
Light scattering [185] 
Neutron fiber diffraction [186] 
Nuclear magnetic resonance [187] 
Scintillation proximity assay [188] 
Small angle neutron scattering [179] 
Ultraviolet-visible spectroscopy [189] 
X-ray crystallography Is] 
Genetic Chemical RNA modification plus base [190] 
Random spore analysis [175] 
Synthetic genetic analysis [175] 
Imaging techniques Atomic force microscopy [191] 
Confocal microscopy [175] 
Electron microscopy [192] 
Fluorescence microscopy [175] 
Light microscopy [175] 
Super-resolution microscopy [175] 
X-ray tomography [175] 
Phenotype-based Nuclear translocation assay [193] 
Posttranscriptional Antisense RNA [194] 
RNA interference [195] 
Protein complementation Adenylate cyclase complementation [196] 
B-galactosidase complementation [197] 
B- lactamase complementation [198] 
Bimolecular fluorescence complementation [199] 
Mammalian protein-protein interaction trap [200] 
Protein kinase A complementation {201] 
Reverse ras recruitment system [202] 
Split luciferase complementation [203] 
Tox-R dimerization assay [204] 
Transcriptional complementation assay [175] 


An overview of some of the commonly used experimental techniques to detect and elucidate protein interactions, 
grouped into categories based on the methods followed 
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Table 2 


Structure-based methods for modeling and predicting protein interactions 


Name Approach Reference 

ClusPro Evaluation of presumed complexes, retains a few promising complexes, scored [205] 
based on electrostatic and free energies 

GRAMM-X_ Global search is performed by using FFT, and Lennard-Jones potential is [206] 
implemented on a fine grid to determine best surface match 

HexServer Implements a closed-form spherical polar FFT correlation expression [207] 

LZerD Predictions are generated by using 3DZD — a mathematical protein surface [208] 
representation method 

Multi- Predictions from LZerD are combined using a genetic algorithm and scored [209] 

LZerD using several methods 

PatchDock The surface of the two molecules is segmented into geometric patches. The [210] 
patches containing interacting residues are filtered, and pose clustering 
techniques are applied 

RosettaDock Monte Carlo-based docking algorithm [211] 

ZDOCK A 3D FFT search of degrees of freedom between two proteins is carried out [212] 


and scored using statistical potential 


Some of the available software and servers available for predicting protein interactions and scoring possible conformations 


the accuracy of the computationally-intensive docking approach. 
Numerous other techniques have also been proposed for structure- 
based interaction prediction, summarized in Table 2. 

While multiple methods are available for predicting protein 
interactions, quality assessment of the predictions for ranking and 
elimination of unlikely complexes and poses is crucial. Docking 
programs generally perform an energy-based scoring of the com- 
plexes to determine relevant structures. However, the scores 
assigned are relative and cannot be compared across platforms. 
Consensus clustering is another method that helps determine qual- 
ity of predicted poses, by clustering structures of similar scores 
together. The score could either be the root-mean-square deviation 
(RMSD) or the template modeling (TM) score. 

Another class of methods to predict protein interaction net- 
works, based on network topology, are not reliant on new 
biological data. The topology of the known interaction networks 
is utilized to predict missing links based on the triadic closure 
principle [7]. Similar to social network analyses, protein pairs are 
given a higher score when interaction partners are shared. 

The progress made to overcome the formidable challenge of 
laying down a comprehensive map of an organism’s interactome, 
especially that of complex eukaryotic beings, has been slow but 
steady. The entire human interactome is estimated to comprise 
more than 100,000 binary protein interactions [8], around only 
half of which have been identified so far through the Human 


1.2 Machine 
Learning 


1.2.1 Evaluation of 
Machine Learning Models 
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Reference Protein Interactome (HuRI) project [9]. Interactomes 
of various organisms have been studied, but most have been incom- 
plete [10-12]. 

However, such experimental and in silico methods cannot 
completely identify the effects of physicochemical factors or track 
the transient dynamics of the complex. Moreover, the differences in 
binding affinities because of loops or disordered regions, posttrans- 
lational modifications, or the influence of physiological factors are 
difficult to predict. Hence, there is a need for accurate prediction 
models that can effectively identify even transient interactions, 
expanding the coverage of interactions predicted while filtering 
out the false-negative and false-positive hits. Moreover, the rele- 
vance and statistical significance of the predicted interactions and 
interaction networks need to be determined. Computational 
approaches have proved beneficial in extrapolating from experi- 
mental data and may help determine the complete interactome of 
organisms. 


Analysis of big data derived from biological sources and subsequent 
prediction of related features have been made possible by advances 
in machine learning (ML). ML algorithms have been implemented 
for the prediction of protein interactions, based on both sequence 
and structure of proteins [13-15]. The input to these predictors is 
observable quantities, analyzed to make statistical predictions. In 
the case of protein interactions, these input “features” are the 
sequence, secondary structure, motifs, domains, genomic features 
such as gene context, and phylogeny; more recently, network 
topology-derived features are used as well. 

ML algorithms can also be classified into glass-box and black- 
box models, depending on whether knowledge of the transforma- 
tion of input to output is available or not. Algorithms such as 
decision trees, random forests, and support vector machines 
(SVMs) allow the generation of explanations underlying the pre- 
diction mechanisms, while artificial neural networks, called black- 
box models, do not allow such explanations. Examples of such 
algorithms and their advantages in protein interaction prediction 
will be discussed later in the chapter. 


The predictions of most algorithms, in this case, are binary — posi- 
tive or negative — i.e., presence or absence of interactions. The 
outcomes of such predictions could be that either of the class is 
predicted correctly (true positives or true negatives) or the predic- 
tions could be incorrect (false positive or false negative). Hence, to 
quantitate the efficiency of the algorithms, multiple threshold- 
dependent measures are employed, as summarized in Table 3. 
Depending on the objective and dataset available, either metrics 
may be considered more important, or the algorithm is developed 
to improve that score. 
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Table 3 


Prediction model metrics 


Metrics Expression Definition 
Accuracy (TN + TP)/ Fraction of correct predictions 
(IN + TP + FP + FN) 
Precision TP/(TP + FP) Positive predictive rate 
Recall/ TP/(TP + EN) Fraction of correctly predicted positive samples 
sensitivity 
Specificity TN/(IN + EP) Fraction of correctly predicted negative samples 
Fl score (2*Precision*Recall) Harmonic mean of precision and recall 
(Precision+Recall) 
AUC = Area under the curve (true vs. false positive rate OR 


precision vs. recall) 


A few parameters to measure algorithm prediction performance have been listed. The term TN stands for true negative, 
TP for true positive, FP for false positive, and FN for false negative 


1.3 Protein 
Interaction Databases 


1.3.1 Primary Databases 


The aim of developing ML models is to gain the ability to 
correctly predict novel interacting partners given a limited dataset, 
i.e., the ML model should be able to generalize well on new data. 
The generalizability of the model depends on the input dataset as 
well as the complexity of the prediction algorithm. A complex 
model would train well on the input data but fall short on new 
samples, while, on the contrary, if the model is too simple, it would 
not train well on the given data. Both these extremes, termed 
overfitting and underfitting, are evaluated while measuring model 
performance. Testing the prediction performance on an indepen- 
dent test set is required to determine the robustness of the predic- 
tor. In the case of neural networks, learning on the training dataset 
is generally followed by evaluation on a validation dataset to reduce 
errors and then to test on an independent dataset measures of 
robustness. 


Thorough scrutiny of published literature and the increasing reli- 
ability of computational predictions have allowed the creation of 
databases of protein interactions. These databases serve as impor- 
tant pools of information to build template-based models and 
machine learning-based predictive models. Protein interaction 
information obtained from various experimental and computa- 
tional methods is compiled in various online resources. These 
databases are generally classified into two categories: 


Collected and curated manually, the protein interaction informa- 
tion available in primary databases are derived from small- or large- 
scale experimental procedures (Table 4). 
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Table 4 
Databases for protein interaction studies and predictions 


Database Type Description Reference 


BioGRID Primary _ Biological General Repository for Interaction — PPI database for [213] 
multiple model organisms 


DIP Secondary Database of Interacting Proteins — curated PPI database for multiple [214] 
organisms 


HINT Secondary High-quality INTeractomes — curated PPI database for multiple [215] 
organisms 


HPRD Primary Human Protein Reference Database — database of PPIs from high- [216, 217] 
throughput experiments 


STRING Secondary Tool for obtaining functional enriched PPI networks for multiple [218] 
model organisms 


mentha Secondary Public archival for PPI data [219] 

HIPPIE Secondary Human Integrated PPI rEference — tool to generate human PPI [220] 
networks 

HuRI Primary Database of human binary protein interactions [9] 

MINT Primary Database of PPIs based on literature [221i 


Commonly used databases of protein interactions, with the type mentioned — primary indicating derived and curated 
based on experiments and secondary indicating even predictions are available 


1.3.2 Secondary The protein interactions derived from experiments as well as high 
Databases confidence predictions from computational approaches are com- 
piled in databases termed secondary. 

Since carrying out experiments for multiple types of proteins 
across organisms is challenging and limited by time, expertise, and 
cost, computational approaches based on protein interactions 
reported in the two categories of databases are used to develop 
prediction algorithms. Moreover, Table 5 compiles multiple 
resources that consolidate the already available PPI data in a user- 
friendly interface. 


2 Methods 
2.1 Feature Machine learning-based predictors of protein interactions are 
Extraction trained on a set of feature vectors that attempt to define important 


information of the proteins and the complexes. ML algorithms can 
be adapted to discriminate between interacting proteins based on 
specific factors that are different between an interacting pair and a 
pair that does not interact. A crucial step that determines the 
performance of the ML model is feature extraction and representa- 
tion. Retaining all essential information intrinsic to the protein and 
its interacting partner remains a hurdle in identifying and 
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Table 5 


Available tools and databases for PPls 


Resource name 


Description URL 


Human Integrated 
PPI 


Reference (Hippie) 


Molecular 
INTeraction 
(MINT) 


Human Protein 
Reference Database 
(HPRD) 


Dobson & Doig 
(D & D) 


Protein Interaction 
Network Analysis 
(PINA) 


STRING 


Web tool to generate human PPI networks - 


Integrating protein interaction networks with http://cbdm-01.zdv.uni- 
experiment-based quality scores mainz.de/~mschaefer/ 
hippie 
Database of PPIs for multiple model organisms https: //mint. bio. 


uniroma2.it/ 


Database of human PPIs from high-throughput www.hprd.org 


experiments 
Benchmark dataset of 1178 protein structures https: //graphlearning.io 
Database of PPIs for multiple model organisms https: //omics.bjcancer. 


org/pina 


Database of PPIs and tool for obtaining functional https: //string-db.org 


enriched PPI networks for multiple model 
organisms 


2.1.1 Sequence Features 


predicting novel protein interactions. Various types of features have 
been used in the recent past to describe proteins and predict inter- 
actions (Fig. 1). The major categories are discussed as follows. 


The primary structure of the protein is the linear sequence of amino 
acids that form the building blocks [16]. The sequence of the 
protein decides the structure, and hence utilizing the information 
encoded in the sequence has been a preferred approach for both 
experimental and computational studies [17]. Several predictive 
algorithms have been reported to represent protein sequences in a 
machine-readable format — from conventional one-hot encoding 
and k-mer encoding to more advanced encoding schemes based on 
amino acid properties. For example, the 20 amino acids can be 
clustered into 7 classes based on the side-chain size and charges — 
(AVG), (LFP), (YMTS), (HNQW), (RK), (DE), and (C). Each 
feature would then be an amino acid triad representing the three 
consecutive residues; each feature vector would also have a 
corresponding frequency vector representing the number of times 
a feature occurs in that sequence [18]. An improved version of this 
method involves clustering the amino acids into six categories 
based on their biochemical properties — (IVLM), (FYW), (HKR), 
(DE), (QNTP), and (ACGS), hence, redefining the relative 


- a) String 


— b) Evolutionary Information 
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Fig. 1 Encoding schemes for proteins. (a) A one-hot-based encoding scheme that allows protein representa- 
tion in the form of the amino acid sequence. (b) An encoding scheme that allows incorporating evolutionary 
information. (c) Graph-based encoding that allows retaining sequential as well as spatial information 


frequencies of the amino acid triads. However, these representa- 
tions generate many zero-valued elements in the feature vectors. 
Since even after scaling and normalization zero-valued elements 
remain zero, not much information is captured, negatively affecting 
the performance of the predictive algorithm. Hence, counting 
dimer residues from position-specific scoring matrices was also 
adopted [19]. Another approach grouped the amino acids into 
four, depending on the chemical properties of the side chains 
(GAVLIMP), (STCNQ), (KRHED), and (FYW) [20]. Termed 
the RFAT system, proteins are represented as a 128-dimensional 
vector with fewer zero elements. 

More recent studies employ neural networks for the extraction 
of global and local sequence features that may be significant in 
interaction prediction. A recent study employs a Siamese recurrent 
convolutional network to capture the influence of protein 
sequences [21]. Stacked autoencoders allow capturing useful infor- 
mation from input data and reconstructing an output, generating 
robust features from protein sequence descriptors [22]. Multichan- 
nel input vectors are also reported to represent the different cate- 
gories of protein features, consisting of information from the 
protein-encoding matrix, the substitution scoring matrices, the 
physicochemical property matrix, and the residue contact energy 
matrix [23]. Some commonly used protein feature extraction 
methods are summarized in Table 6. 
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Table 6 


Sequence-based protein feature extraction methods 


Name Description Descriptors 
Amino acid composition Frequency of each amino acid type in a protein or peptide 20 
sequence 
Composition of k-spaced Frequency of amino acid pairs separated by any k residues Variable 
amino acid pairs 
Tripeptide composition The number of tripeptides represented by amino acid types r,s, 8000 
and t 
Dipeptide composition The number of dipeptides represented by amino acid types rand 400 
s 
Dipeptide deviation from Calculated using dipeptide composition (Dc), theoretical mean Variable 
expected mean (Im), and theoretical variance (Tv) 
Grouped amino acid Based on classes of amino acids according to their Variable 
composition physiochemical properties 
Binary One-hot encoding 20 xn 
Moran correlation Based on the distribution of amino acid properties along the Variable 
sequence 
Geary correlation Determine if adjacent observations of the same phenomenon Variable 
are correlated 
Normalized Moreau- Autocorrelation of a topological structure 21xn 
Broto autocorrelation 
Composition/ Amino acid distribution patterns of a specific structural or 13 
transition/ physicochemical property in a protein or peptide sequence 
distribution 
Conjoint triad Properties of one amino acid and its vicinal amino acids by Variable 
regarding any three continuous amino acids as a single unit 
Sequence-order- Based on distance matrix describing a distance between the two Variable 
coupling number amino acids 
Pseudo-amino acid Based on hydrophobicity values, hydrophilicity values and the Variable 
composition side-chain mass of the 20 natural amino acids 
AAindex Based on physiochemical properties of amino acids 544 


2.1.2 Evolutionary 
Features 


Comparison of a specific protein with similar sequences against a 
reference database allows compiling an alignment, indicating the 
probability of the occurrence of amino acids at each position. Since 
there are 20 canonical amino acids, for each protein of length L, a 
position-specific scoring matrix (PSSM) of dimensions L*20 could 
be constructed [24]. Transforming protein sequences to PSSM 
allows including homology sequence, informative of the evolution- 
ary past. Hence, PSSM-based features incorporate not only 


sequence but also evolutionary features. 


2.1.3 Domain-Based 
Features 
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The PSSMs created by employing PSI-BLAST with e-values set 
to 0.001 are employed for various proteins to generate a novel 
featurization algorithm — PsePSSM. The pseudo-PSSM (PsePSSM) 
allowed predicting membrane interactions of proteins [25] — in 
another approach, considering the L*20 matrix as 20 blocks. The 
Block-PSSM features are then converted to a 1*400 feature vector, 
shown to improve the prediction of protein function [26]. Imple- 
menting PSSMs for prediction of protein interactions allows two 
benefits — there is no special annotation that would be biased 
toward a specific subset of the proteomics data — and, more impor- 
tantly, allows encoding evolutionary information vital to protein 
interaction development. Generating bigram features from PSSMs, 
coupled with the features derived from pseudo-amino acid compo- 
sition for proteins, allowed better prediction of drug target inter- 
actions [27]. Coupling more protein-specific and context 
information also allows for improvement in the performance of 
human PPI prediction algorithms. An example can be found in a 
study that combines features derived from posttranslational modi- 
fication information, codon usage, tissue information, and gene 
ontology. Different classifiers are then trained and shown reliable 
in predicting PPIs in humans [28]. Other similar studies based 
solely on PSSMs as well as incorporating other features have been 
reported to predict protein interactions in various species [29-31 ]. 


The binding specificity of a protein is determined by the structural 
features in the binding pocket of the domain. Domains are compact 
three-dimensional structures formed by conserved stretches of pro- 
tein sequences. Domains are capable of existing and functioning 
independently of the protein. Proteins may contain multiple 
domains. It has been shown that predictions based only on 
sequence features fall short on new data and that this limitation 
may be addressed by including domain information. An earlier 
report showed that the domain information included had a high 
predictive value [32]. Prediction of host-pathogen interactions was 
carried out by developing a novel framework that integrated pub- 
licly available intraspecies protein interaction information with their 
domain profiles [33]. The frequency of interaction of specific 
domain pairs was calculated from the dataset, and the probability 
of interaction of novel protein domain pairs was calculated. 
Another approach involved the integration of protein domain 
information along with sequence features and other protein proper- 
ties to predict virus-host protein-protein interactions. Implement- 
ing linear kernel SVMs, the predictor fared well when trained a 
combination of multiple types of features [34]. 

Among other methods that use domain structure information 
include an empirical force field to calculate energy functions for 
human domain interactions [35] or even the construction of 
position-weighted matrices (PWMs) of all possible SH3 protein- 
ligand complexes using homology modeling [36]. An SVM-based 
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2.1.4 Motif-Based 
Features 


2.1.5 Other Structural 
Features 


predictor trained on domain structure and sequence information of 
various PDZ interactions are also trained and tested on multiple 
organisms for scanning the proteome [37]. 


The three-dimensional organization of conserved protein 
sequences is called a motif, which, unlike domains, cannot retain 
structure and function independently of the protein [38]. Short 
linear motifs (LMs or SLiMs) have been shown to play a crucial role 
as mediators in protein interactions [39, 40]. SLiMs are generally 
two to eight amino acid residues in length and can directly interact 
with protein structures in the same or other proteins. Motif features 
and motif-motif interaction features may be exploited for the pre- 
diction of protein interactions. However, interactions among such 
smaller motifs are different from those of domains — smaller inter- 
action surfaces lead to smaller binding energies and weaker affinities 
yet are important in protein-protein interactions in response to 
cellular environments [41 ]. 

An early study attempted to predict motifs from multiple 
sequence alignments of HIV proteins, incorporating this informa- 
tion to generate a prediction model to estimate HIV-human pro- 
tein-protein interactions [42]. Yet another older report suggested 
using motif information derived from sequences from the eMotif 
database, along with other sequence information, to predict the 
protein interaction reliably using the kernel method [43]. Varying 
lengths of conserved signatures derived from PROSITE, utilized in 
a bag-of-feature approach that does not retain the sequence of 
information, have been recently used to train a confidence-rated 
boosting algorithm to predict drug-protein interactions [44,45]. A 
method to utilize the motif-domain interactions to predict virus- 
host PPIs was also presented that obtained better results than 
previous reports [46]. Motif surface accessibility was included to 
filter the predicted virus-host PPIs to address the issue of false 
positives [47]. 


The three-dimensional arrangement of atoms of the individual 
residues makes up the protein structure. Deciphering the molecular 
functions of a protein often involves examining its structures. In a 
set of proteins and their interacting partners, similar structures tend 
to have similar interactions. Multiple studies have utilized the 
structural features to derive possible interacting partners [48- 
51]. Structural features could include coordinates, electrostatic 
properties, and surface area. Publicly accessible databases like Uni- 
Prot and PDB tend to be essential resources for obtaining sequence 
and structure information. Recently, representing protein struc- 
tures as attributed graphs with residues as nodes and the bonds as 
edges has also shown to be a viable option for training predictive 
models [52, 53]. However, approaches based on the 3D structures 
of biomolecules are limited by the paucity of high-resolution struc- 
tures and experimental benchmarks. 


2.1.6 Network Topology- 
Based Features 


2.1.7 Feature Extraction 
and Encoding of Other 
Binding Partners: DNA, 
RNA, and Small Molecules 


Nucleic Acids: DNA and 
RNA 
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Protein interaction networks are also predicted based on the knowl- 
edge of existing networks. Such methods learn the topological 
connections within a network and predict PPIs without being 
dependent on data from biological sources. Missing interactions 
are predicted based on the interactions already established. Link 
prediction algorithms rely on social network analysis [7]. Other 
methods to predict the structure of a network include “network 
path of length 3,” intrinsic geometry structure, common neighbor, 
collaborative filtering-enhanced topology, and random walk-based 
diffusion propagation [54-58 ]. 


Many cellular processes are governed by the protein-nucleic acid 
(NA) interactions that drive translation, transcription, replication, 
reverse transcription, replication, posttranscriptional processing, 
and transport of RNA and translation and degradation of mRNA. 
Dysregulation in protein-NA interactions leads to various diseases 
[59-61]. DNA- and RNA-binding proteins, hence, form a crucial 
but heterogeneous group of macromolecules. Determining and 
modulating protein-NA interactions are dependent on prior 
knowledge of structure, limited by the experimental determination 
of complexes which is a slow and intensive process [62, 63]. 

Multiple prediction algorithms have been based on various 
approaches to obtain effective features, encompassing most infor- 
mation from biological data (Fig. 2). The more common encoding 
methods have been listed as follows: 

One-hot encoding is the most common encoding scheme for 
DNA and RNA sequences. The four bases are encoded in 1 and 
0 based on the presence at a particular location, resulting in a L*4 
matrix, where L is the length of the sequence. 

Since one-hot encoding results in a low-dimensional feature 
vector, it may be insufficient to retain sequence context informa- 
tion. Hence, extended one-hot encoding methods have been pro- 
posed [64]. A stacked codon-based encoding scheme maps three 
consecutive nucleotides to a pseudo-amino acid. The codon to 
residue map used is standard; however, it is conducted in an over- 
lapping manner due to the uncertainty of the starting site. This 
representation is then converted to a one-hot matrix of L*21 
(20 residues+1 stop codon). 

K-mer encoding is one such method to convert biological 
sequence data to a machine-readable format. RNA or DNA 
sequences can be transformed using the k-mers sparse matrix 
method [65]. Each sequence containing four variables - ACTG in 
DNA and ACUG in RNA ~ is read one nucleotide at a time. A unit 
is “k” nucleotides at a time, and for a sequence of length “L,” the 
total k-mers would be L-k + 1. 

While k-mer encoding is a discrete representation of features, 
the correlation among k-mers cannot be retained. Hence, the need 
for a continuous distributed representation arises. DNA and RNA 
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Fig. 2 Encoding schemes for RNA or DNA sequences. A representative short RNA sequence is depicted as an 
example. (a) One-hot encoding, where four channels are present each for the four different nucleotides. The 
presence of each nucleotide is noted across the sequence. (b) k-mer encoding, as an example 3-mer 
encoding is shown. (c) Continuous distributed representations for various sequences can be derived using 
algorithms such as Word2vec. (d) Stacked codon encoding slides a three-base window over the sequence and 
predicts the amino acid for the triplet. The amino acid sequence is then one-hot encoded 


Small Molecules 


are treated as a language, k-mers as words, and RNA sequences as 
sentences. Inspired by the recent developments in natural language 
processing, word embedding methods are adapted to suit 
biological data. For example, word2vec and GloVe embedding 
allow learning of continuous value vectors for k-mers [66, 67 ]. 


The binding of small molecules, or drugs, to a biological target 
induces a change in behavior or function, which leads to changes in 
physiology. The target could be proteins or nucleic acids. Inferred 
by experimental studies of pharmacology or reverse pharmacology, 
establishing drug target interactions is a time-consuming as well as 
costly process [68, 69]. Hence, there is a need for reliable compu- 
tational methods to predict drug target interactions, reducing the 
research space to be covered in the laboratories [70]. Over the past 
few decades, the number of compounds being synthesized is 
increasing rapidly. Yet, the possible target profiles and effects are 
not yet identified. Additionally, there are still a large number of 
diseases that warrant potential cures or at least drugs to manage the 
symptoms. Since information is already available on multiple drugs 
and their biological targets, there is a need to utilize this high- 
dimensional data to build predictive algorithms for determining 
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Fig. 3 Encoding schemes for small molecules. An example of the drug 3,4-methylenedioxymethamphetamine 
is depicted. (a) String-based methods translate the chemical structure to a string that attempts at retaining 
structural information. For example, simplified molecular-input line-entry system (SMILES) focuses on 
localized substructures, while self-referencing embedded strings (SELFIES) guarantee valid molecular struc- 
tures. (b) Chemical fingerprints encode the structure into a binary vector based on the substructures present. 
(c) Graph-based embeddings transform the molecule into a series of nodes and edges depicted by adjacency 
and node features 


novel drug target interactions that could be potentially beneficial. 
Prediction of such interactions will not only help discover new 
drugs but also allow drug repurposing as well as determination of 
potential side effects [71-73]. 

Computational methods of small-molecule interaction predic- 
tion include three approaches — ligand-based, which assumes that 
small molecules interacting with a protein will be structurally simi- 
lar to the natural ligands; the second approach is based on docking, 
which uses the 3D structures of proteins and the drugs to predict 
binding, and a third approach is a chemogenic approach. Extracting 
and implementing the information of the drug and target simulta- 
neously, the chemogenic approach includes both feature-based 
methods and similarity-based methods [74]. A major advantage 
of this method is that it allows utilization of the extensive biological 
data available across various online platforms and public databases. 
Feature-based methods revolve around discovering and imple- 
menting discriminative factors of the drug, target, and interaction 
interface. Hence, accurate representation of the features that serve 
as input to the prediction models is essential. Common data for- 
mats and encoding schemes are discussed in this section (Fig. 3): 
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(i) String 


Converting small-molecule structures to human-readable as 
well as machine-readable forms may be via various methods, the 
most adopted method being conversion to strings. Among the 
most widely used format to represent chemical compounds as 
strings are the simplified molecular-input line-entry system 
(SMILES) [75]. The compound is described starting from one 
atom, visiting all the others by trimming bonds of the rings. The 
line entry is extended by specific rules for each atom, bond, cycle, 
branch, and stereochemical property. Other string-based represen- 
tations have also been developed to describe better the substruc- 
tures or constraints, such as SMARTS and SELFIES [76, 77]. 

Encoding the SMILES or other representations into words or 
numbers allows training prediction algorithms on the drug features 
derived from the structure. An example of encoding as a mix of 
one-hot and multi-hot vectors is normalizing the number of 
valence electrons and encoding chirality and aromaticity for each 
atom [78]. In many studies, the SMILES representation is directly 
input as a vector to allow the neural networks to extract relevant 
features [79]. Mapping characters to real number vectors allows 
word-like embedding of SMILES, which may be achieved by 
Word2vec. Along with sequential networks like RNN or LSTM, 
these “word” embeddings also serve as powerful representations 
[80, 81]. 


(ii) Fingerprint 


Constitutive scaffolds and certain functional groups occur 
commonly in chemical compounds. These can be used to define 
chemical fingerprints to describe small molecules as a simpler rep- 
resentation of their complex structures [82]. The several ways to 
extract drug fingerprints can be categorized into either topology- 
based or SMARTS-based schemes. The topology-based finger- 
printing schemes include information of the bonds and atoms 
after calculating distances in the molecules, e.g., Morgan, ECFP, 
and 2D pharmacophore. The SMARTS-based fingerprinting con- 
siders the bind orders and aromaticity based on the SMARTS 
profile; PubChem and MACCS are examples of such a fingerprint- 
ing algorithm [83, 84]. 


(iii) Graph-Based 


Weave or graph-neural fingerprints of drugs have been shown 
to be successful in capturing the chemical properties of the com- 
pounds in multiple recent studies. The molecules are converted to 
graph adjacency matrices with atom and bond information. These 
matrices, as inputs to graph convolution networks (GCN), are 
useful in generating the context of nodes. GCNs are of two sub- 
types — spectral and spatial GCNs. Spectral GCNs consider the 
entire graph, while spatial GCNs only consider local subgraphs 
[85, 86]. 


2.2 Applications 
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Additionally, the fusion of multiple features derived from pro- 
teins has also been an area of active research. Multiple methods of 
feature extraction have been developed to improve information 
derived. Treating protein sequence as a set of signals allowed 
implementing an autocovariance-based encoding scheme, which 
considers both the positional amino acid composition and the 
physicochemical properties of the protein [87 ]. Initially introduced 
in the field of theoretical physics, the resonant recognition model 
assigns amino acids a set of physicochemical properties and then 
encodes them as numerical sequences. The degree of correlation 
between the parameters and protein activity or energy of binding 
allows using RRM for protein analysis [88]. Various other encoding 
and representation schemes have been proposed and implemented, 
some of which have been reviewed in a recent publication [89]. 

The identification of protein interactions is essential to the 
study of biological networks, yet their prediction and identification 
are error-prone. The limitations posed by the implementation of 
sequence and structure-only-based features could be overcome by 
incorporating high-throughput biological data including, but not 
limited to, microarrays, next-generation RNA sequencing reads, 
and expressed sequence tags [90-92]. Inclusion of the gene expres- 
sion profile features would allow identification of gene products 
that change expression together with some other factor, regulated 
by mechanisms that can be unraveled by such studies. However, the 
data used as features should be obtained using standard experimen- 
tal techniques and standard pipelines for data analysis to ensure the 
robustness of the algorithm. Since interrelationships between pro- 
tein interaction and gene expression profiles may be hard to eluci- 
date, statistical measures of correlation and setting significance 
thresholds are implemented before integration into training data 
for prediction algorithms. 


Understanding the relationship among various biological entities is 
equally vital as the mere knowledge of their existence in formalizing 
many biological processes. For instance, cell differentiation is 
dependent on both the types of proteins present in the system 
and their associations. High-throughput technology has made 
biological network studies possible and allowed for progress in 
open problems related to drug target discovery and pathway analy- 
sis. Further, in the era of “big data,” extracting knowledge from the 
enormous amount of data has become a vital part of most domains, 
including biology and bioinformatics [93]. Machine learning 
(ML) has proven to be an efficient tool to discover underlying 
patterns in biological networks, build models, and make a predic- 
tion based on the most robust model. ML algorithms, including 
Bayesian networks, random forests, support vector machines, and 
hidden Markov models, have extensively been used in genomics, 
proteomics, and systems biology [94]. 
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2.2.1 PPI Networks to 
Understand Disease 


2.2.2 Protein Function 
Prediction 


Recently, deep learning (DL)-based solutions have seen 
unprecedented applications in diverse fields like machine vision, 
signal processing, natural language processing, and computational 
biology due to their ability to model complex data. For instance, 
IBM’s Watson and Google’s AlphaFold have achieved great success 
toward solving critical problems in clinical oncology and protein 
folding, respectively [95, 96]. The biggest advantage of DL-based 
methods relies on the fact that a problem is solved by passing input 
signals to simulate a network and recognize intricate patterns. 
Artificial neural networks (ANNs), which are the fundamental 
building blocks of most deep learning architectures, closely resem- 
ble the working of neurons in the human brain. DL can combine 
simpler features and learn complex substructures in data. In other 
words, with the presence of nonlinearity in stacked layers of a DL 
architecture, data can be hierarchically represented with an increas- 
ing level of abstraction. 


Most of our current knowledge of the etiology of various diseases 
comes from approaches aiming to uncover their genetic basis. The 
ability to generate individual genome data with next-generation 
sequencing methods promises to change the field of translational 
bioinformatics even more. Therefore, it is necessary to identify 
molecules and mechanisms triggering, participating, and 
controlling perturbed biological processes for understanding the 
biological intricacies of pathogenesis and disease progression. Deci- 
phering such molecular mechanisms leading to diseased states is an 
even bigger challenge than elucidating the genetic basis of complex 
diseases. Even when the genetic basis of a disease is well under- 
stood, not much is known about the molecular details leading to 
the disorders. 


Functional annotation of proteins plays a crucial role in identifying 
disease-causing aberrations in genes or proteins, understanding 
cellular mechanisms, and developing tools for prevention, diagno- 
sis, and treatment of disease. Complex relationships among geno- 
type and phenotype have guided the analysis of genome-wide 
molecular interaction data. Multiple databases have curated and 
integrated such heterogeneous data at varied extents of biological 
complexity [97]. As opposed to manual curation, other methods 
try to extend the primary data with predictions and indirect associ- 
ation to estimate a bigger picture of the biological process [98- 
100]. Similarly, functional annotation of a newly sequenced protein 
is performed using homology mapping or by identifying functional 
domains from preexisting databases. BLAST, FAST, Pfam, Pro- 
Dom, and SCOP are some of the commonly used homology- 
based methods [24, 101-103]. Such models are often guided by 
the “guilt by association” principle that works under the assump- 
tion that adjacent nodes in a network have more functional similar- 
ity in comparison to farther nodes. 


Table 7 
Application of deep neural 
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networks, recurrent neural networks, and emergent architectures in tasks 


involving biological datasets 


Omics Signal processing 
Research topic Reference Research topic Reference 
Deep neural networks Protein structure [222-224] Brain decoding [225] 


Recurrent neural networks 


Emergent architectures 


Gene expression regulation [226-228] Anomaly classification [229] 
Protein classification [230] 


Anomaly classification [231] 
Protein structure [232] Brain decoding [233] 
Gene expression regulation [234] Anomaly classification [235] 
Protein classification [236] 
Protein structure [237] Brain decoding [235] 


2.2.3 Protein-Drug 
Interaction Site Prediction 
Using PPls 


2.3 Template-Based 
Methods of Protein 
Interaction Prediction 


PPI-based computational methods to determine protein func- 
tion are limited due to the lack of uniformity in the network 
topology. While contrasting functions arise from different gene 
sets, the prediction accuracy is largely affected by the number of 
neighbors or choice of distance metric. DNNs have been exten- 
sively used for the prediction of protein function. 


Targeted therapies greatly benefit from the functional discovery of 
candidate disease-related genes. Computational methods based on 
PPI profiles are extensively summarized in __ literature 
[104, 105]. One such implementation involves network construc- 
tion aided by gene expression profiles to identify critical nodes in a 
biological network. The primary goal of any method involving PPIs 
is to identify critical nodes and their neighbors in the network as 
therapeutic targets, based on the rationale that protein interactions 
play an important role in systematic aberrations. This led to a 
notion of “guilt by association” that assumed that entities related 
to a known disease-causing agent (gene/protein) are likely to be 


involved in the disease. Some typical applications are enlisted in 
Table 7. 


Recent advances in sequencing have allowed the generation of a 
wealth of protein data. Integrating and extracting relevant data 
from such diverse and extensive sources demand computational 
methods. Besides machine learning-based methods, protein inter- 
actions can be predicted by various methods, e.g., interolog identi- 
fication, gene coexpression, and gene cluster analysis. Some of the 
methods have been mentioned below. 

As protein complexes are increasingly purified and the struc- 
tural data is made available, template-based predictions of protein 
complexes are attracting attention. The interactions between pro- 
teins and networks are modeled based on the similarity of sequence 
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2.3.1 Homology-Based 
Approaches 


or structure of other protein complexes, also known as the tem- 
plates. The method involves identifying a nonredundant dataset of 
templates for the protein complexes to predict and then evaluating 
the predictions based on some scoring function. The scoring func- 
tion is generally statistical potential or the energy of the complex. 
Template-based methods tend to be more efficient than docking, 
especially at a proteome scale, helping limit the number of possible 
favorable conformations [106, 107]. 

Assembling an inclusive but nonredundant dataset of templates 
is crucial to template-based methods — templates, if not correctly 
selected, could lead to false positives or false negatives. A limitation 
of this approach is the unavailability of similar templates for pro- 
teins. Complex structures and novel interactions cannot be pre- 
dicted in the absence of templates. However, when suitable 
templates are available, the prediction algorithms are reliable and 
fast. The increasing number of protein and protein complex struc- 
tures being deposited in PDB promises template-based algorithms 
will continue gaining attention. Broadly, template-based methods 
can be divided into two major categories: homology-based and 
interface-based algorithms. 


Template-based prediction methods are based on the finding that 
proteins with an identity of at least 30-40% associate similarly 
[108]. However, exceptions to the findings also exist [109]. Pre- 
dicting interactors based on sequence homologs scored on the basis 
of empirical potentials derived from experimentally established 
interactions has been used widely. Knowledge-based potentials are 
easy to implement and have been shown to be successful in pre- 
dicting interologs, i.e., interaction homologs [110, 111]. A web- 
server designed on a similar approach, utilizing Blast2 as a homolog 
search algorithm — InterPreTS — predicted protein interactions 
based on this knowledge-based scoring method [112]. A database 
of experimental- and template-based interaction models for various 
species constructed the GWIDD consists of structural representa- 
tions of multiple genomes [113]. 

An advantage of homology-based methods is that the bound 
state of even unstructured proteins can be predicted by comparing 
with similar proteins. Another approach named WSsas method, 
which maps queried protein sequences to known structures on 
the basis of the functional residues of homologous proteins, has 
also shown to be promising [114]. Distinct members of the same 
family of domains are also known to associate in a similar manner. 
Hence, the integration of domain information is incorporated into 
predicted protein complexes [115]. The matched domains scored 
by statistical potentials derived from side-chain contacts are shown 
to distinguish non-native contacts accurately [116]. Scoring and 
discrimination based on dynamics of interface residues have also 
been applied [117]. Additional methods reported to predict 


2.3.2  Interface-Based 
Methods 


2.3.3 Gene-Based 
Methods 
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structural interaction based on homology include machine learning 
techniques that combine geometric, physicochemical, and similar- 
ity information, covered in detail in sections that follow. 


Although the structure of a protein tends to be more conserved 
than its sequence, interface residues are evolutionarily more con- 
served than the structure [118]. Hence, it is also possible that 
entirely different protein pairs may share similar interaction inter- 
face frameworks [119]. Multiple such reports prompted the idea 
that implementing information derived from the interface regions 
alone, independent of the sequence and global structure, could 
predict protein complex interactions adequately. Homology-based 
methods fall short when protein sequence similarity is low. How- 
ever, interface-based methods for prediction are sequence- 
independent. 

The first algorithm to implement a surface-based prediction 
was PRISM [120]. PRISM combines evolutionary information 
and geometric complementarity, allowing predicting target pro- 
teins by homologous spatial motif search. In a later study, the 
sequence homology and global fold parameters are found to be of 
lesser importance than the local structural alignments [121]. It was 
also reported that the conservation at the interface is useful for 
predicting interactions for even evolutionarily remote proteins. 
PredUs implements this knowledge for protein binding prediction 
in a diverse structural dataset [122]. 


Proteins that are likely to interact may be identified by studying 
gene coexpression data. Clustering algorithms can group together 
genes with similar expression profiles. Individuals of such a cluster 
may be considered as functional association candidates and even 
physical binding partners. Such candidates can be validated by 
checking multiple time points or states [123]. However, a draw- 
back of this method is that expression data can be high-throughput 
and noisy. Protein levels also do not perfectly correlate with gene 
expression levels, thereby yielding misleading interaction 
information. 

A group of genes within a set intergenic distance are called gene 
clusters. Ranging from a few to more than a hundred genes, clusters 
house genes which have related functions and potential interactors. 
In bacteria, genes housed in operons are transcribed together. In 
eukaryotes, gene clusters are coregulated. It has been observed that 
genes involved in the same cellular pathway are often present in 
close proximity [124]. Conversely, if the genes are not in proximity 
in the genome, this approach cannot identify interactions. Multiple 
resources exist to implement this method [125, 126]. Since gene 
clusters indicate functional rather than physical interactions, they 
are a simple approach but depend heavily on the number of gen- 
omes used as reference. 
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2.3.4 Network Topology- 
Based Approaches 


2.4 Learning-Based 
Methods to Identify 
Protein Interactions 


2.4.1 Machine Learning- 
Based Methods 


Protein interaction networks have begun to be commonly repre- 
sented as graphs- proteins form the nodes and the association 
between proteins from the edges. Topological features extracted 
from protein interaction graphs would indicate the number of 
direct or indirect neighbors, shortest paths, etc. “Hubs” in such 
graphs are a small number of proteins that have multiple interaction 
partners. These hubs serve as centers of function and integrity of 
cellular processes. A mathematical representation of such protein 
interaction networks allows the identification of functional rela- 
tions and novel interactions. If proteins have multiple common 
interactors in the network, they can be assumed to be a part of 
similar processes [127, 128]. For proteins with shared interaction 
partners, the structure, sequence, or biochemistry can be assumed 
to be similar. Protein interactions can be predicted by just the 
topological features independent of prior knowledge of sequence 
or structure [129]. Integration of protein sequence and function 
information would also allow better prediction of protein com- 
plexes [130]. Detection of conserved interactions and predicting 
novel interactions could also be achieved by alignment of various 
protein interaction networks, which has been reviewed in detail 
elsewhere [131]. 


It has been observed that in silico predictions of PPIs depict similar 
accuracy when compared with large-scale experimental PPI data- 
sets. Furthermore, machine learning algorithms that are quick and 
scalable can improve the efficiency of experimental methods when 
used in tandem [132]. Machine learning techniques used for pre- 
dicting PPIs can be broadly classified into two categories: super- 
vised and unsupervised. It is based on whether the input variables 
need to be labeled according to the expected outcome or not. In 
general, supervised learning infers a mapping function from given 
input-output pairs that can be used to train a model for predicting 
outcomes for other inputs. On the other hand, unsupervised 
learning discovers the hidden structure within unlabeled training 
data for drawing meaningful inferences. 

Artificial neural networks (ANNs), support vector machines 
(SVMs), Bayesian inference, and decision tree-based methods 
such as random forest (RF) are some of the supervised learning 
algorithms that are used for predicting PPIs [133]. Supervised 
machine learning is implemented for classification problems, i.e., 
segregating input data points into specific classes, where a set of 
quantitative or categorical features are analyzed for features that are 
capable of discriminating given input variables into specified classes. 
Figure 4 represents a schematic describing the various types of ML 
algorithms and their general use cases. Clustering techniques fall 
under unsupervised learning, where methods such as k-means, 
single-linkage, and spectral clustering are used for PPI prediction. 


Traditional programming 
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UNSUPERVISED 
LEARNING CLUSTERING 


Machine Learning Group/cluster data 


based on inputs 
DATA ALGORITHM DATA OUTPUT 
CLASSIFICATION 
MACHINE MACHINE SUPERVISED 


LEARNING 


OUTPUT ALGORITHM Develop predictive 
3 model based on REGRESSION 


inputs and outputs 


Fig. 4 Machine learning and its derivatives. A schematic highlighting the fundamental difference between 
traditional programming and machine learning. Broad categories of ML, based on the algorithm and problem, 


are also compiled 


PPI prediction, which is generally a binary classification task, 
has two categories: the “positive” (p) class, containing proteins that 
interact with each other, and the “negative” (n) class, containing 
proteins that do not interact. A given instance or data point is 
classified as “positive” if the computed score (represented as a 
random variable X) is above a given threshold and “negative” 
otherwise. A given prediction can fall under one of the four cate- 
gories for a binary classification task, namely: 


1. True positive (TP): Proteins interacting and correct inference 
by model as interacting partners. 


2. True negative (TN): Proteins not interacting and correct infer- 
ence by model as noninteracting. 


3. False positive: Proteins not interacting but incorrect inference 
by model as interacting. 


4. False negative: Proteins interacting but incorrect inference by 
model as not interacting. 


The datasets for model training, validation, and testing can be 
prepared using techniques such as randomization, cross-validation, 
and bootstrapping. Given a large enough dataset, it can be divided 
randomly into “k” equal parts. Each of these “k” parts can then be 
randomly used as training and testing sets. This randomization 
ensures the random sampling of training and testing sets that is 
vital for averting any selection bias during the training process. 
However, as in many cases, large enough datasets required for 
proper randomization are rarely available. In such cases, the same 
dataset is repeatedly split into training and testing sets in different 
ways by a technique called cross-validation or rotation estimation. 
These techniques can be exhaustive where cross-validation involves 
either leave-p-out cross-validation (LpOCV), where p observations 
are set aside as the test set and the remaining observations are taken 
as the training set or leave-one-out cross-validation (LOOCV), 
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2.4.2 Decision Tree- 
Based Method 


2.4.3 Probabilistic/ 
Bayesian Classification 


where p= 1. On the other hand, k-fold cross-validation is the most 
common form of non-exhaustive cross-validation, where the origi- 
nal sample is randomly partitioned into k equal-sized subsamples, 
from which a single subsample is retained as the test set and the 
remaining subsamples are used as the training set. The process is 
then repeated k times, with each of the k subsamples used exactly 
once in the test set. A common example is the fivefold cross- 
validation, where the training dataset is divided into five subsets, 
of which four subsets are used in training the model and the 
remaining one is used for testing it, and the process is repeated 
five times, using a different subset in each iteration. 


Decision tree algorithms involve recursive partitioning of the input 
space by selecting the best attribute and expanding the leaf nodes of 
the tree until a predefined stopping criterion is attained. For 
instance, a simple criteria could be the minimum number of train- 
ing instances assigned to each leaf node of the tree. The best test 
condition for splitting is determined by different algorithms using 
different metrics such as Gini impurity and information gain. Gini 
impurity is a measure of misclassification that denotes the probabil- 
ity of a randomly chosen element from the set being incorrectly 
labeled according to the distribution of labels in the subset. 


Biologists have a strong preference for Bayesian-probabilistic clas- 
sifiers due to their diverse functionalities. Moreover, while machine 
learning solutions like SVMs and neural networks are considered 
black-box models due to their limited interpretability, Bayesian- 
probabilistic models are more natural and can handle numerical as 
well as categorical data. Algorithms that use conditional probability 
distributions as a way to model relationships among features of 
training samples and their class labels are called probabilistic classi- 
fiers. For instance, if the features of input data are denoted by 
xi(z = 1,..., M), then the feature vector for each data point can 
be represented as x = [xl,x2,...,xM] and the probability of the 
data point belonging to each of the N classes (c= cl, ¢2,..., cN) as 
AG= él|.#), (C= 2) a), 44 PCH Ens); 

After modeling the class conditional probabilities, the probabi- 
listic approach seeks to classify the input data points to the class 
with the maximum probability. In case of a binary classification 
problem, this translates to computing the ratio Y= P(x| C = cl) 
P(x| C = c2) and then choosing cl if Y > 1, and c2 otherwise, 
because the decision boundary is formed by the region of the 
feature space where Y= 1. 

Probabilistic methods based on log-odds scoring schemes have 
been widely used in PPI prediction, as well as for filtering high- 
throughput experimental datasets that can potentially include sev- 
eral FPs. Genomic features such as coexpression values, essentiality 
and co-localization, structural features, and sequence signatures 
have been used under Bayesian-probabilistic frameworks for PPI 
prediction [134]. 


2.4.4 Artificial Neural 
Networks 


2.4.5 Clustering 


2.5 Challenges and 
Limitations 


2.5.1 Reliability of 
Protein Interaction Data 
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Artificial neural network (ANN) is one of the oldest machine 
learning algorithms used to perform nonlinear statistical modeling 
and develop binary classification models and is now evolving into 
state-of-the-art deep learning algorithms such as recurrent neural 
networks and stacked autoencoders among others. Due to their 
huge model capacity, ANNs generally require less statistical training 
and are able to implicitly detect complex nonlinear relationships 
between dependent and independent variables and detect all possi- 
ble interactions between predictor variables. 

ANNs are made up of a network of connections, where each of 
the individual elements transfer information with upstream /down- 
stream neurons. Each connection is assigned a trainable parameter 
called weight (177). The propagation function pj(t) = & to t)wiy 
computes the input 7(¢) to the neuron j from the outputs 07(¢) of 
predecessor neurons. 


Clustering is the primary form of unsupervised machine learning 
technique for classification problems, which tries to segregate data 
points into groups such that data points placed in the same group 
are more like each other than to those in other groups. Clustering is 
useful in exploratory pattern analysis, pattern classification, and 
decision making and for outlier detection. Also, clustering is pri- 
marily used in the cases when the class labels are not known in 
advance. The main advantage of clustering is its ability to determine 
the intrinsic classification within a set of unlabeled data, hence not 
requiring a separate training stage. 

The different distance metrics used by clustering algorithms 
include (a) Euclidean distance metric, (b) Euclidean squared dis- 
tance metric, (c) Manhattan (city block) distance, (d) Chebyshev 
distance, (e) Pearson’s correlation coefficient, (f) squared Pearson’s 
correlation coefficient, and (g) Spearman’s rank correlation coeffi- 
cient. A general schematic describing all the associated elements of 
a simple neural network is described in Fig. 5. 


The rapid development of tools for experimentally identifying pro- 
tein interactions has been accompanied by computational methods 
for the analysis of experimental data as well as the prediction of 
novel interactions. Despite the remarkable progress in the develop- 
ment of technical and analytical tools for the identification and 
prediction of protein interaction networks, obstacles remain — 
some inherent to the field and some unique to each approach. 
Some difficulties encountered while implementing DL methods 
have also been listed. 


Many experimental studies have reported multiple protein interac- 
tion networks, and with high-throughput studies comes the inevi- 
table problem of noise. Limited by various factors, studies also yield 
numerous false negatives like transient or cell stage-specific protein 
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2.5.2 Data Integration 


2.5.3 Dynamic Protein 
Network Construction 


2.5.4 Evaluation of 
Protein Interaction 
Networks 
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Fig. 5 Overview of a simple neural network architecture. Data, in the form of a 
feature matrix, is fed to the network following by nonlinear transformations in the 
hidden layers. The parameters of these hidden layers are heuristically optimized 
to produce the required output 


interactions that may not be detected. Hence, filtering out noisy 
data before analysis and integration is essential. Moreover, reducing 
the false negatives arising out of noisy and incomplete interaction 
data is another challenge that needs to be tackled. 


The reliable analysis of the protein interaction network centrality, 
modularity, and dynamics is hindered by input data noise. A more 
comprehensive analysis could be obtained by integrating data from 
multiple biological sources, such as RNA-Seq data, protein domain 
information, cellular localization, etc. However, effective integra- 
tion of data from multiple biological sources is a research area with 
many gaps. 


The inherent transient properties of protein expression and inter- 
action have shifted the focus from understanding static to dynamic 
networks. Recently, the integration of time series data of expression 
and static interactions has been proposed. Yet, the accuracy of 
expression and the number of time points needed are limiting 
factors for the applicability of such methods. High-sensitivity and 
high-throughput technologies are slowly replacing traditional 
time-intensive methods, yet spatial and temporal analysis remains 
hindered by the tissue and cell location-specific processes. Hence, 
combining multiple data sources across various cell-specific and 
temporal analyses will emerge as a hotspot in the field of computa- 
tional biology. 


The computational analysis and prediction of protein interaction 
networks depend on the reliability of the experimental data. Hence, 
there is a need to evaluate the quality of the interaction networks in 
a manner that is not sensitive to the experimental conditions. 


2.5.5 Lack of Data 


2.5.6  Overfitting 


2.5.7 Data Imbalance 
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Deep learning methods are data-hungry. A lot of data is required to 
develop a robust and accurate deep learning algorithm [135]. In 
certain biological cases, available data may not be enough. Data 
could be collected from similar tasks and transfer learning applied 
to design a better mapping function [136]. However, it is unknown 
if this approach would allow a sufficient representation of the 
original data. Modifying well-trained models from similar tasks to 
fit the available data is a viable alternative [137]. Especially in the 
case of image data, rotation and mirroring do not generally change 
the labels of the data. Nevertheless, in the case of biological 
sequence or structure, such approaches should be implemented 
with care. Simulated data may also be used to add to the available 
data, provided the physical processes are well-understood and the 
simulators yield reliable samples [138]. 


DL models are generally high complexity models that deal with a 
large number of parameters, which is the reason such algorithms are 
at risk of overfitting, i.e., performing well on the train data but 
unable to generalize to the test data [139]. Although some recent 
studies suggest that the implicit bias of the training process deals 
with the issue of overfitting, certain cases demand specialized tech- 
niques to make models robust [135, 140, 141]. Various algorithms 
have been developed in the past few years to induce generalizability 
and can mainly be classified into three categories. The first type, 
based on model parameters and architecture, includes dropout, 
batch normalization, and weight decay [139, 142, 143]. The sec- 
ond type acts on the inputs — data augmentation and data corrup- 
tion techniques [144]. A third type acts on the model output by 
penalizing overconfident predictions [145]. A detailed review of 
the types of regularizations and their benefits can be found in a 
recent review [146]. 


Biological data is generally seen to be imbalanced — positive samples 
outnumbered manyfold by the negative samples [147]. Training a 
model on imbalanced data introduces a bias toward the majority 
class, but reducing the amount of data used to train the model also 
reduces the information that can be extracted from the whole 
dataset. Proper performance measurement criteria should be used 
to evaluate the predictions to overcome the challenge of imbalance, 
measuring performance on both classes [148, 149]. Another 
method involves modifying the loss function to penalize the 
model if it underperforms on the minority class. The input data 
could also be under- or up-sampled when training the predictor. 
Where the data can be arranged hierarchically, different models can 
be built for each level [150]. 
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2.5.8 Lack of 
Interpretability 


2.5.9 Uncertainty Scaling 


2.5.10 Catastrophic 
Forgetting 


2.6 Standard 
Modeling Protocol 


2.6.1 Computing 
Resources 


Deep learning methods had been criticized for being black-box 
models. Recent times have seen an increasing effort toward improv- 
ing the interpretability of these models. Especially in biology, it is 
important that the prediction model used is interpretable to allow 
an understanding of the features — motifs, sequences, or structures — 
which may be important for the process being studied. Many algo- 
rithms that derive feature importance from deep learning models 
assign example-specific importance scores. Reliable scores may be 
achieved by employing perturbation-based or backpropagation- 
based approaches. Perturbation-based methods alter parts of the 
input and measure the effect on the model output [151- 
154]. Backpropagation-based methods allow a signal from the 
output layer to be sent backward to the input layer to determine 
the importance of the input [155, 156]. Although such methods 
have been shown to be useful in multiple cases, they are still under 
active development. 


Machine learning models not only perform prediction but also give 
a confidence score for each query of the model [157]. The confi- 
dence score informs the users of the reliability of the predictions. In 
biological problems, confidence scores prevent building on mis- 
leading and unreliable model outcomes. In such cases, scaling the 
scores to evaluate the actual risk in the given context is important. 
Probability scores from the softmax algorithm are usually overcon- 
fident predictions and hence not on the right scale [145 ]. Obtaining 
reliable outputs would involve post-scaling to softmax outputs. 
Examples of methods to perform scaling include Platt scaling, 
Bayesian binning, and histogram binning [157-159]. A recently 
proposed temperature scaling showed much better results than 
other methods [160]. 


Catastrophic forgetting is when a deep learning model is unable to 
learn and remember different tasks that may be not explicitly 
labeled, may switch unpredictably, or may occur sequentially 
[161]. This is common in biology where the data is continually 
accumulating and changing. Training new models from scratch 
after incorporating the new data seems like a plausible solution, 
though it is computationally intensive and time-consuming. At 
present, three types of methods are employed to deal with cata- 
strophic forgetting, based on regularizations, using dynamic neural 
network architectures and rehearsal training methods, and the last 
kind being based on dual-memory learning systems [161-164]. 


Most ML workflows can be implemented on a standard Unix 
workstation in standard configuration. It can also be equipped 
with a graphics processing unit (GPU) to train deep learning mod- 
els. The exact specifications of the machine would vary depending 
on the size of the dataset and model architecture. In addition to a 
CUDA-capable GPU and its suitable drivers, CUDA (https:// 


Software Installations 


Machine Learning 
Frameworks 


2.6.2 Data Processing, 
Model Building, and 
Evaluation 
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developer.nvidia.com/cuda-toolkit) is an underlying parallel com- 
puting platform, which must be separately installed for training 
deep learning models. Additionally, as current deep learning frame- 
works like TensorFlow and PyTorch are implemented on the 
Python programming language, the user should have a certain 
level of familiarity with the language. 


It is generally recommended that all the required packages be 
installed in a virtual environment. This can be easily managed by 
any environment manager like Conda (https://docs.conda.io/en/ 
latest /). 


As mentioned earlier, multiple machine learning frameworks are 
available with active development and extensive community sup- 
port. scikit-learn (https://scikit-learn.org/stable/), TensorFlow 
(https://tensorflow.org), Theano (http://deeplearning.net/soft 
ware/theano/), and PyTorch (https://pytorch.org) are some of 
the machine learning and deep learning frameworks. 


As described earlier, there are multiple methods and sources of 
processing protein-protein interaction data. For instance, PRO- 
FEAT ((http://bidd.group/cgi-bin/profeat2016/ligand/pro 
fnew.cgi) is a commonly used web server to calculate 
physiochemical and structural features from a given protein 
sequence. Similarly, Propy (https://pypi.org/project/ 
propy3/1.0.0a2/) and iFeature are other python packages that 
can compute a large number of sequence features (amino acid 
compositions, dipeptide compositions, Moran autocorrelation 
descriptors, Geary autocorrelation descriptors, composition, tran- 
sition, distribution among others). A general overview of the vari- 
ous steps in an entire ML-/DL-based PPI prediction workflow is 
summarized in Fig. 6. 

Annotated protein-protein interaction data is retrieved and 
processed from openly available databases or novel experimental 
sources. Processed data is then divided into training, testing, and 
validation set, generally in 80/10/10 proportions. However, this 
ratio can be altered depending on the size of the dataset. Assuming 
that the entire data follows the same underlying distribution, this 
segregation can be performed randomly. However, cross-validation 
techniques are employed to avoid misleading results. 

Data and labels must be appropriately scaled before training to 
ensure that different features with disproportionate scales are stan- 
dardized to prevent the accumulation of large weights and the 
formation of skewed gradients in the system during the training 
process. Generally, a standard scaler (zero mean, unit variance) is 
employed for the same. Alternatively, min-max scaling or log trans- 
formations are also employed depending on the use case. While 
training, hyperparameter optimization plays a critical role in the 
model’s overall performance. The set of hyperparameters forms a 


314 Dhvani Sandip Vora et al. 


Sequence based 


Sequence Homology 
Motif/Domain based 
Correlated mutation 


[omen anne anne enna nnn n nena nnn nn enn ene ees, 


Carrara eee ote | 


1. FEATURE EXTRACTION |-; 


Structure based 


Structural homology 


Support Vector Machines 


2. FEATURE ENGINEERING |---~----------------~----------------5 


3D characteristics 
Depth/Protrusion 
Solvent accessibility 


weeeenenennnneneen5, 


Physiochemical properties 


Sequence Homology Sequence Homology 
Motif/Domain based Motif/Domain based 
Correlated mutation Correlated mutation 


Protein docking 
Surface patches 


peasascnccaacnssaces 


Deep Learning 


Feed forward Neural Multilayer Perceptron Recurrent Neural nets 
networks Convolutional Neural Geometric Deep Learning 
Autoencoders network 3D CNNs 


Fig. 6 Overview of a standard machine learning or deep learning implementation. Data retrieval, feature 
extraction, feature engineering, and model optimization are some of the major elements of any ML/DL 


workflow 


2.7. Perspectives 


space that is tuned using the training data to minimize the error. 
However, architectures such as CNNs have additional hyperpara- 
meters like filter size and stride length that need to be separately 
optimized using grid search. 

Although most machine learning and deep learning methods 
are considered black-box predictors with very little interpretability, 
attention mechanisms and utilities such as saliency analysis and 
GradCam enable to deduce the relative importance of individual 
features in the data. 


With the advancement of high-throughput techniques in biology, 
deep learning (DL) has rapidly become a widely used powerful 
technique to achieve various goals. Judging by the current popular- 
ity of DL in the field of biology, from inferring protein localization 
in a cell to predicting binding events, it can be safely assumed that 
DL will dominate the field for the upcoming few years [165- 
167]. In addition to proteomics, many other domains have pro- 
gressed toward omics scale — transcriptomics, genomics, and lipi- 
domics, to name a few. Biological networks inferred are usually 
incomplete, especially when derived from a single-omics study. 
Deciphering biological complexity across the different layers with 
improved descriptive capability may be achieved by integration of 
multi-omics approaches. Multiple tools and methods have been 
designed for analysis and integration of multi-omics data, which 
offer numerous applications in protein interaction studies along 
with many fields [168]. 
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Recently, natural language processing models are increasingly 
applied in bioinformatics to simplify DNA and protein sequence- 
based problems. With amino acid and nucleotide sequences being 
interpreted as meaningful sentences, language models allow deter- 
mination of structure, function, and their interrelationships. How- 
ever, a fundamental problem in such applications is how to define 
biological “words.” Recently developed techniques are aimed at 
solving this problem — byte pair encoding (BPE) [169] and uni- 
gram language model (ULM) [170]. Another tool, SentencePiece, 
developed to directly generate words, integrates both the BPE and 
ULM algorithms [171]. Also, as chemical structures of biomole- 
cules become increasingly available, the development of quantum 
mechanics and quantum machine learning techniques may signifi- 
cantly affect the future of computational biology [172-174]. 

Advances in computational methods to infer protein interac- 
tions accelerate the discovery of molecular mechanisms of cellular 
pathways and diseases and subsequently promote drug discovery 
and novel diagnosis methods. This chapter covers traditional and 
recent protein interaction network discovery and prediction meth- 
ods, briefly mentioning the challenges associated with them. With 
proteomics research entering the multidisciplinary integration 
stage, deep learning techniques will continue to be integrated 


into computational proteomics. 
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Machine Learning Methods for Survival Analysis 
with Clinical and Transcriptomics Data of Breast Cancer 


Le Minh Thao Doan, Claudio Angione, and Annalisa Occhipinti 


Abstract 


Breast cancer is one of the most common cancers in women worldwide, which causes an enormous number 
of deaths annually. However, early diagnosis of breast cancer can improve survival outcomes enabling 
simpler and more cost-effective treatments. The recent increase in data availability provides unprecedented 
opportunities to apply data-driven and machine learning methods to identify early-detection prognostic 
factors capable of predicting the expected survival and potential sensitivity to treatment of patients, with the 
final aim of enhancing clinical outcomes. This tutorial presents a protocol for applying machine learning 
models in survival analysis for both clinical and transcriptomic data. We show that integrating clinical and 
mRNA expression data is essential to explain the multiple biological processes driving cancer progression. 
Our results reveal that machine-learning-based models such as random survival forests, gradient boosted 
survival model, and survival support vector machine can outperform the traditional statistical methods, i.e., 
Cox proportional hazard model. The highest C-index among the machine learning models was recorded 
when using survival support vector machine, with a value 0.688, whereas the C-index recorded using the 
Cox model was 0.677. Shapley Additive Explanation (SHAP) values were also applied to identify the feature 
importance of the models and their impact on the prediction outcomes. 


Key words Breast cancer, Machine learning, Survival analysis, Data integration, Interpretability 


1. Introduction 


Breast cancer is a leading cause of cancer-related deaths worldwide 
[1]. According to the latest report published by Cancer Research 
UK, breast cancer occupies 15% of the annual new cancer cases and 
7% of all cancer mortality in the UK [2]. With advancements in 
medical treatment and research, the overall survival rate has nearly 
doubled in the last 40 years, e.g., around 78% of the patients survive 
more than 10 years [2]. The survival rate after 5 years for early 
diagnosed patients varies between 90% and 99%, while this rate 
sharply drops to only 28% for late diagnosed patients [3]. Therefore, 
early detection and treatment are crucial for breast cancer patients 
as malignant cells tend to metastasize in later phases [4]. 
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Clinical data has been often used to develop clinical prediction 
models and gain disease insights [5]. More recently, with the 
advancement of high-throughput sequencing technology, exten- 
sive omics data and methods for their integration have been pro- 
duced, including genomics, transcriptomics, proteomics, and 
metabolomics data [6-8]. The study of multi-omics data allows to 
investigate the relationships, roles, and actions of the various types 
of molecules constituting the cells of an organism and gain a 
comprehensive understanding of the biological system under exam- 
ination. Information from omics data can be used to identify diag- 
nostic and prognostic markers and support the development of 
personalized treatments [9]. Many studies have used omics data 
to develop accurate prognostic models for different cancer types 
[10-12], achieving more precise predictions than conventional 
clinical methods. 

Following the breakthrough in exploring omics data, multiple 
assays from the same set of instances have been recently consoli- 
dated to generate multi-omics data. Their availability has reformed 
the biological and medical fields by making avenues for system-level 
integration tactics. Multi-omics integration has been used with 
great success to understand cancer and other disease progression 
mechanisms, to eventually obtain patient-specific clinical treat- 
ments and prevention strategies [13-15]. In fact, developing algo- 
rithms able to process multi-omics data could provide sharpness on 
biomolecules from different layers, pave the way toward large-scale 
cell optimization, and facilitate the understanding of complex 
biological processes involved in cancer progression 
[16, 17]. Hence, compared to single-omics data, models using 
multi-omics data and mechanism-based approaches can provide a 
deeper understanding of cancer progression and related mechan- 
isms, including discovering novel biomarkers, studying the interac- 
tion with viruses, and detecting cancer subtypes [18-21]. 

Most survival-based molecular models have mainly used a sin- 
gle type of omic data [22]. However, recent investigations found 
that a proper combination of clinical and omics survival data could 
significantly improve clinical outcomes [5 ]. This integration usually 
outperforms the models that rely only on clinical or omics data 
[23]. Hence, it is necessary to investigate the effectiveness of using 
different types of data, such as clinical data, omics data, and their 
integration on the performance of the predictive models. 

Recently, machine learning (ML) models have been successfully 
developed to process biomedical data, including characterization of 
cell phenotype, detection of cancer, and prediction of survival out- 
comes [24-27]. Specifically, ML has been widely applied in clinical 
diagnosis and medical image analysis to develop computer-aided 
diagnosis systems [28]. The volume and variety of clinical and 
genomic data collected from patients are significantly increasing, 
revealing novel opportunities to apply ML and generate more 
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insights into the molecular investigation of tumors and cancer 
prognosis. ML methods have facilitated the development of a 
more precise landscape about tumor heterogeneity and contributed 
to precision oncology. This allows specific patients to have an 
individual treatment plan based on a personalized diagnostic and 
prognostic risk profile. However, precise diagnosis and treatment of 
breast cancer are still representing one of the main challenges in 
healthcare [29]. Hence, developing accurate prognostic methods is 
necessary to significantly improve risk stratification after diagnosis 
and increase survival expectation. In order to achieve this, several 
patient-specific techniques have been proposed, either relying on 
clinical records, biological markers, or their combinations 
[30, 31]. However, there is still a need to identify the key biomar- 
kers affecting cancer progression and survival outcomes in order to 
develop more accurate personalized treatments. 

Survival analysis is a reliable and widely applied statistical tech- 
nique among prognostic modeling methods, which attempt to 
evaluate the probability of events to occur within a specific time 
[32]. The prediction outcomes of this type of analysis, such as 
cancer death or recurrence, are fundamental to numerous clinical 
judgments in oncology and play an essential role for patients, 
doctors, and scientists [33]. Among the currently available survival 
analysis models, the Cox proportional hazards (CPH) regression 
model is the most widely applied approach to investigate the effect 
of the input features on the survival time of the patients 
[34, 35]. So far, numerous prognostic models have been proposed 
to apply the CPH regression model on clinical and transcriptomic 
data [36] and multi-omics data [37]. However, ML has recently 
shown its successful applications in the medical and healthcare 
fields. Many ML models have been employed in cancer survival 
analysis because of their ability to handle high-dimensional data, 
non-linear relationships, and interaction effects [38, 39]. ML-based 
approaches for survival analysis, such as random survival forests 
[40], gradient boosted survival model [41], survival support vector 
machine [42], Cox-nnet [43], and SALMON [44], have empha- 
sized the feasibility of accurately predicting cancer outcomes using 
clinical and omics data. 

Although survival analysis is widely applied in clinical studies, 
its prediction in practice still relies heavily on the subjective inter- 
pretation of the clinician, limiting reproducibility and accuracy 
[45]. Therefore, this tutorial aims to investigate breast cancer sur- 
vival analysis by proposing a framework based on ML algorithms to 
perform survival analysis using clinical and transcriptomic data. 
CPH model and three ML-based models, namely random survival 
forests (RSF), gradient boosted survival model (GBS), and survival 
support vector machine (SSVM), are implemented and tested on 
the METABRIC dataset [46]. Our objectives are to classify the 
patients into risk groups (i.e., high risk and low risk) and unveil 
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2 Backgrounds 


2.1 Survival Analysis 


the prognostic predictors impacting the survival outcomes of 
patients. Consequently, identifying the patient risk groups could 
assist doctors in determining the course of treatment, promoting 
effective therapies, and supporting personalized clinical decision- 
making and recommendation. 

Hence, the aim of our work is twofold: (1) to present a proto- 
col for applying ML algorithms in survival analysis. Specifically, 
elements of the study design, experiment process, and performance 
evaluation criteria are described and outlined to generalize and 
adapt our protocol to other public available clinical and transcrip- 
tomic data and (2) to uncover critical prognostic factors affecting 
the survival likelihood of breast cancer patients by employing the 


most recent statistical techniques for the interpretability of ML 
models (i.e., SHAP values) [47 ]. 


This section presents the main methodologies and ML algorithms 
applied in our tutorial. The main differences between the three ML 
algorithms applied for survival analysis are also discussed. 


Survival analysis is a statistical procedure applied for analyzing the 
expected duration of time until the occurrence of an event of 
interest (e.g., death or disease recurrence). One of the main chal- 
lenges associated with survival analyses consists of dealing with 
censored data, a form of missing information that occurs because 
of the limited observation time, observation withdrawal, or lost to 
follow-up during the study period [48]. Censored data can be 
classified into two groups: left-censored and right-censored data. 
The former occurs when the event has already occurred before the 
beginning of the study, while the latter occurs when the survival 
time is only known to exceed a certain value, but the exact time is 
unknown. Right-censored data is the most common type of cen- 
sored data [48]; therefore, this chapter will focus on the survival 
analysis for right-censored data. 

For a given instance 2, the survival information associated with 
zis comprised of two elements: a binary event indicator £,, in which 
E; = 0 for censored instance and E; = 1 if the event (e.g., death) is 
observed, and a failure event time T;, a non-negative random 
variable representing the duration between the beginning of the 
study and the occurrence of the event. The formula below reports 
the probability of observing the event by time ¢. 


E(t) = Prt < di: (1) 


The function F(t) is defined as the cumulative distribution 
function. 


2.2 Cox Proportional 
Hazards Model 
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Let the probability density function be denoted as f(z). The 
survival function, S(t), provides the probability that the event is 
observed after time ¢, and it is defined as 


S(t) = Pr[T >t] =1- F(t) = 7 Fle)dee (2) 


The hazard function A(t) represents the probability that the 
event will happen within the interval [¢, ¢+ dt), given that it did 
not occur before time ¢. Thus, a lower hazard corresponds to a 
greater chance of survival. The hazard function /(t) is defined as 


P <T: + dt|T => 
h(t) lim = a (3) 
it—0 dt 
By using the definition of S(z) in Eq. 2, the hazard function can 
also be written as 


W(t) fh = —# tog S(0) (4) 


Survival and hazard functions are two fundamental concepts in 
survival analysis, and they are connected by the expression below 


s(2) _ -f od (5) 


Equation 5 can be derived by integrating the first and last terms 
in Eq. 4 from 0 to ¢. The integral inside the parenthesis in Eq. 5 
describes the sum of the risks of observing the event between time 
0 and time ¢. This quantity is called cumulative hazard, and it is 


defined as H(t) = [oh(x)dx. 


The Cox proportional hazards (CPH) model [49] has been the 
most commonly applied method in clinical studies to investigate 
the relationships between time-to-event or survival-time outcomes 
and explanatory variables. The CPH model is a regression approach 
used to calculate the hazard ratio (HR) and its confidence interval 
between patients belonging to different risk groups. Specifically, 
the HR can be interpreted as a relative risk. The CPH model is a 
semi-parametric model, and it is denoted by the hazard function 
h(t) representing the hazard at time ¢ defined as 


h(t) = ho(t) ei +Paxat + Barve | (6) 


where f(t) is the baseline hazard function, and fj, fo, .., B, are the 
corresponding regression coefficients of covariates x1, X, ..., Xp. 

A value of e4' above 1, or B; above zero, shows that the increase 
in value of the zth covariate will lead to the rise in event hazard and, 
consequently, the reduction in survival-time length. In other 
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2.3 Machine 
Learning Models 


words, the covariate is positively correlated with the event likeli- 
hood or negatively associated with the survival-time length. In 
contrast, a value of e% below 1, or £; below zero, shows that an 
increase in value of the zth covariate will lead to a decreased proba- 
bility of observing the event. If e%' is equal to one, that covariate 
does not affect the survival probability. Overall, observing e4' above 
one is a bad prognostic indicator in cancer studies, whereas observ- 
ing e”' below one is a good indicator. 

Let us consider two observations y, v with covariates x,,; and %,;, 
i=1,..., Rand hazard functions defined as 


h,(t) = ho(t) Pix tP2X2+--+PiXuk (7) 


h,(t) = ho(t) eit Paxat.+Bexue (8) 


Using the definitions of ,(¢) and ,(t) in Eqs. 7 and 8, the HR 
for the two observations yp, v is calculated as 


k k 
oe 
HR = h(t) * ho(t)e= = i=l = pina 


h,(t) Sime one 
ho(t)e= el 

Since the HR is not a function of time #, the hazard risk of the 
two groups must remain constant through the whole study, and 
their hazard curves should not cross. In fact, the CPH model is 
based on two assumptions: (1) the survival curves for two or more 
strata must have proportional hazard functions over time ¢ and 
(2) each covariate makes a linear contribution to the model. 


. (9) 


ML models employed for survival analysis have recently received 
increasing interest due to their promising applications in cancer 
research [39]. They are mainly applied to predict survival outcomes 
and the corresponding survival likelihood following statistical sur- 
vival analysis approaches. However, rather than focusing on survival 
curves estimation, ML approaches mainly focus on predicting the 
time-of-event occurrence by merging the traditional statistical sur- 
vival analysis techniques with the most recent statistical models. 
The advantages of using ML algorithms to perform survival analysis 
include the opportunity of providing more accurate solutions 
allowing the analysis of survival data while dealing with the statisti- 
cal challenges associated with high-dimensional data. 

In this tutorial, patient-specific survival risk probabilities are 
predicted using the most recent ML algorithms, including random 
survival forests, gradient boosting model, and survival support 
vector machine, which have recently become popular due to their 
effectiveness in handling survival data [39]. 


2.3.1 Random Survival 
Forests 
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Random survival forests (RSF) is a random forest-based learning 
method used to analyze right-censored survival data [50]. The 
model uses an ensemble approach to generate predictions by inte- 
grating the estimations of multiple trees. This allows the model to 
gain more precise predictions than using a single tree. The algo- 
rithm employs tree-structured and bagging algorithms, typical of 
the random forest model [51], based on the three steps below: 


1. Arandom bootstrap sample from the training set is selected to 
grow a tree. 


2. The tree nodes are divided by a random attribute selection 
rather than using all the features available in the dataset. 


3. The prediction of the random forest algorithm is determined 
by averaging the predictions of the individual tree. 


Consequently, each tree in the forest is grown on an indepen- 
dent bootstrap sample extracted from the training data. This model 
is more independent and lowers the correlation between features, 
thus reducing the variance of the unbiased base learners occurring 
when using a single decision tree, and gaining better predictive 
performance. This technique aggregates different trees’ decisions, 
and it often offers a better generalization. The random forest 
algorithm has been demonstrated to be a widely adopted and 
effective ML technique for high-dimensional data, and it is 
regarded as one of the most successful ensemble methods [52]. 

RSF extends the above approach by integrating censored infor- 
mation from survival data into the splitting rules applied for the 
growth of the forest. RSF is one of the most powerful and widely 
used learning algorithms for survival analysis. Each survival tree 
splitting employs the log-rank splitting rule to develop a set of 
survival trees, maximizing the log-rank test statistic. Other splitting 
rules, such as log-rank score or conservation of events, can be used 
during the growing phase of the forest. However, log-rank splitting 
is the most popular technique, and it is the focus of this tutorial 
algorithm. 

According to Ishwaran et al. [50], the description of the RSF 
algorithm can be summarized as below: 


1. The number of trees 7 in the forest and the number of pre- 
dictors k for the splitting of each node are defined. 


2. n bootstrap samples from data are drawn. Each sampling 
excludes out-of-bag data, which can be proven to be approxi- 
mately equal to 37% of the full dataset [53]. 


3. A survival tree in each bootstrap sample is grown using the 
following approach: 


332 Le Minh Thao Doan et al. 


2.3.2 Gradient Boosted 
Survival 


e kcandidate predictor variables are randomly chosen. 


e For the possible splitting point of each k, the log-rank 
statistic is computed. 


e The node is split based on the log-rank splitting rule that 
maximizes the survival difference between children nodes. 


e The tree continues to grow to full size under the constraint 
that the number of event observations (e.g., deaths) in each 
node is greater than a predefined minimum terminal 
node size. 


4, A cumulative hazard function is computed for each tree. Then, 
the results are averaged to estimate the ensemble cumulative 
hazard function for all trees. 


5. Harrell’s concordance index [54] is calculated on the out-of- 
bag data and used to determine the predictive accuracy of the 
model. 


Gradient boosted survival analysis (GBS) is a gradient boosting 
machine learning model applied to analyze censored data 
[55]. The predictive algorithm is based on an additive regression 
model of sequentially fitted weak learners (base learners) while 
minimizing the loss function. It is thus regarded as an ensemble 
learning method. It is a nonparametric approach and does not 
require any functional form assumption, providing researchers 
with more flexibility than other survival models. GBS also generates 
more robust returns than one single learner as it consolidates pre- 
dictions from various estimations of weak learners. 

In GBS, each successive tree is an enhancement over the previ- 
ous one. In other words, the second tree improves over the first tree 
by learning from the residual of its prediction, while the third tree 
enhances over the first and second ones and so on. The outcome is 
estimated by the weighted sum of all the predicted values given by 
the individual trees. 

The gradient boosting algorithm can be summarized as below 


[56, 57]: 


1. The number of iterations MV, the base learner model /(x, 8), and 
the loss function (y, f) are defined, where (x, yey is the input 
data, @ are the parameter estimates, and f is the unknown 
function that maps the input variables x to the target variables 4. 


2. An initial random guess f, of the unknown function f is 


defined. 
3. For each iteration k, the following steps are performed: 


e The negative gradient of the loss function at iteration k is 
calculated. 


e The new base learner function /(x, @;) is fitted. 


2.3.3 Survival Support 
Vector Machine 


2.4 Feature Selection 
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e The best gradient descent step size p; is estimated as follows: 


N 7 
Pr = arg min, Xi Uyrfxs (mi) + ph(xi, Oe)]. (10) 
i=l 


¢ The estimated function f, is updated as f,=f,)+ 
phx, Ox) : 
4. The output of the final model is defined as f(«) = 37!) f;,(2). 


In this tutorial, we use regression trees as the base learner 
model and CPH as the loss function in the GBS [58]. By doing 
this, the hazard, survival function, and log-hazard ratio are esti- 
mated by summing up the prediction of each regression tree. 


Support vector machine (SVM) is a very popular supervised 
learning method for regression and classification problems. SVM 
has also been applied to censored data for survival analysis 
[59]. The central idea of SVM is to classify data points by maximiz- 
ing the margin between groups in a high-dimensional space and 
finding a separating hyperplane that minimizes misclassification 
[60]. The hyperplane separates the classes and is as far from the 
closest observations as possible. Then, support vectors are defined 
as the data nearest to the maximum margin hyperplane. 

Survival support vector machine (SSVM) follows the same 
approach as SVM, but it employs an asymmetric penalty function 
to handle survival data. Specifically, linear SVM can be adapted to 
solve survival analysis by ranking, regression, and hybrid 
approaches. In a ranking approach, the learning model assigns a 
lower rank to instances with a shorter time of an event by examin- 
ing all possible combinations of instances in the training data while 
predicting the exact survival times in the regression problem. 
Because of its efficiency and optimal performance, this work focuses 
on linear SSVM to handle survival analysis problems. We apply a 
more efficient SVM algorithm called FastSVM [61 ]. This model has 
lower computational training costs since it is based on truncated 
Newton optimization and order statistic trees. 


When working with transcriptomic data, the number of features 
often exceeds significantly the number of observations leading ML 
algorithms to overfit the data and report poor performance. For 
this reason, several feature selection techniques have been pro- 
posed, with the aim of identifying and selecting an optimal subset 
of features. The most widely applied feature selection models 
include Pearson correlation, Spearman correlation [62], principal 
component analysis (PCA) [63], and genetic algorithm 
(GA) [64]. However, Schemper et al. [65] have shown that the 
Pearson and Spearman correlation models are unsuitable to work 
with censored data. Besides, dimensionality reduction methods, 
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such as PCA, are difficult to interpret and are more suitable to use 
for the linear or approximately linear high-dimensional data [66], 
while GA-based wrapper techniques have low computational effi- 
ciency [67]. Maximum relevance and minimum redundancy 
(mRMR) [68], a technique applied to select features based on 
their correlation with the response variables, has the advantage of 
fast computation and stronger robustness than the above feature 
selection techniques. Hence, mRMR is applied in this tutorial. 
According to Peng et al. [68], the model ranks the features accord- 
ing to both their relevance to the outcome and the low correlation 
between themselves. The steps performed by the mRMR algorithm 
are described below. First, mRMR identifies the first feature based 
on the maximum relevance value. 

Let I be the mutual information (MI) to measure both rele- 
vance and redundancy between features. The MI of two random 
features m and 7 is given by 


p(m,n) 

I(m, n) J Spm, n) log 2(m) pn) dm dn, (11) 
where p(m), p(m), and p(m, m) denote the probabilistic density 
function of m, m and their joint probabilistic density function, 
respectively. 

Next, let X denote the whole feature set, while S denote the 
selected feature set containing s features, and cis the outcome class. 
For an individual feature «;, I(«;, c) denotes its MI with the class c. 
The maximum relevance criterion, reflecting the largest depen- 
dence of x; on the target class c, is computed by 


max D(S,¢), D(S,c) = Ty > fone) (12) 


xES 


The features selected by the maximum relevance criterion are 
likely to have large dependency among them. Hence, the minimum 
redundancy condition is added and calculated by 


min R(S), R(S) = 755 Yo ix) (13) 


xi, xj7ES 


where I(x;, x;) is the MI of feature x; and .;. 

The final mRMR feature set is chosen by simultaneously opti- 
mizing Eqs. 12 and 13. An incremental search approach is used to 
find the near-optimal features. Let us consider the S,_; feature set 
with s—1 features already identified. The sth feature is selected 
from the remainder feature set {X— S,_;} by optimizing the fol- 
lowing condition: 


max | I(;; ¢) — S- EE Ng) | (14) 


x;EX—S, 1 
7 . xpES. 


3 Methods 


3.1 Dataset 


3.2 Study Design 
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This section describes the dataset used in this tutorial, and the 
experiments run to perform survival analysis. We conducted three 
experiments to investigate the performance of CPH- and 
ML-based models on clinical data, transcriptomic data, and the 
integration of the two data types. First, we report the dataset 
description, and then the study design, initial setting, and details 
of three experiments are discussed. 


METABRIC dataset [46] was used to assess the predictive perfor- 
mance of the CPH model and the ML methods implemented in 
this chapter. The dataset has been downloaded from cBioPortal 
(www.cbioportal.org/datasets). The tumor information in the 
original METABRIC study was collected from five centers in the 
UK and Canada. The objective of the study was to analyze the effect 
of genomic and transcriptomic profiles on breast cancer survival to 
discover the optimal treatment approach of patients. The dataset 
contains clinical information for 2509 primary breast cancer sam- 
ples and 2509 molecular profiling, including 1904 transcriptomic 
data with a maximum follow-up period of 355 months. Clinical 
data was obtained from cohort studies and trials, including the 
survival time in months and status (deceased or censored), while 
gene expression data was extracted from mRNAseq, which provides 
a snapshot of the transcript abundance of different gene transcripts 
of the cell. The detailed description of tissue specimens and staging 
can be found in the original METABRIC study of Curtis et al. 
[69 ]. To explore the power of CPH and ML models, we considered 
all the clinical and transcriptomic covariates available in the dataset. 


We set up three experiments to evaluate the CPH and ML models 
for survival analysis, including (1) clinical data, (2) transcriptomic 
data, and (3) integrating clinical and transcriptomic data. Python 
programming language (version 3.8.8) and its libraries on the 
Anaconda environment (version 4.10.3) were used to conduct the 
experiments. Python 3 can be run on any popular operating system 
such as Windows, Mac, and Linux. However, the steps in this 
chapter are demonstrated on a Windows 10 Pro—64-bit operating 
system. 

We separately implemented all the experiments in Jupyter note- 
books, an open-source web application that integrates code, visua- 
lizations, computational output, and other resources in one single 
file. However, the code can also be efficiently run online on Google 
Colab (https://colab.research.google.com) without any software 
installation. 
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3.3 Initial Setting 


The CPH and the current state-of-the-art ML algorithms pre- 
sented in Subheading 2 were applied and evaluated in this chapter 
using scikit-survival packages [70] for Python. 

The procedure of the experiments followed in this tutorial is 
described below and presented in Fig. 1: 


Step1:. The libraries and packages required for the analysis are 
first installed and imported. 


Step2:. The METABRIC dataset is loaded. 


Step3:. Preprocessing steps and data exploration techniques are 
performed to investigate the dataset. 


Step4:. Feature selection is applied to select the optimal number 
of features from a large set of variables (this step is applied 
only to the transcriptomic data, where data reduction is 
necessary to improve the performance of models). 


Step5:. The CPH model is run, and the results are plotted and 
interpreted. 


Step6:. ML algorithms are set up and run to generate the final 
predictive models. 


Step7:. Results are interpreted using SHAP values, and models are 
compared. 


The outputs of the survival ML models are patient-specific risk 
scores, which incorporate OS time and the corresponding event 
censorship indicator. A higher risk score indicates a greater likeli- 
hood of observing the event of interest (e.g., decease) early. There- 
fore, it is necessary to find an appropriate metric to evaluate the 
performance of models based on such predicted risk scores. 

Harrell’s concordance index (C-index) [54], a goodness of fit 
for survival models, is used to measure the concordance probability 
P(n;>ni|T;> Tj) for two instances 7 and 7 to rank association 
between their OS time points T;, J; and the models’ prediction ;, 
nj. It assesses the possibility for a random observations pair that the 
patient with a higher risk score is the one that has a shorter survival 
time. Hence, it estimates how well a model predicts the ordering of 
decease times of patients. C-index values range from 0 to 1, where a 
value of 0.5 corresponds to a random model or no predictive 
discrimination. In contrast, C-index equal to 1 implies a precise 
association or perfect ranking of the observed and predicted sur- 
vival times. 


Before starting the analysis, the folders named Data and Plot are 
required to be set up in your local machine to store all data and 
figures for the experiments. Then, Python 3 [71] needs to be 
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Machine Learning Protocol for Survival Analysis 
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Fig. 1 Tutorial Workflow. After setting the environment (step 1), clinical and transcriptomic data was retrieved 
from cBioPortal (step 2). To start the experiment, we loaded the data, followed by data cleaning and data 
exploration steps (step 3). Due to the high-dimensional nature of transcriptomic data, a feature extraction 
step was required before running the machine learning models (step 4). Next, the CPH model was run, and the 
results were plotted to investigate the HR and p-value associated with each risk factor (step 5). Then, we 
built, trained, and evaluated the ML models for survival analysis. Patients were then divided into high- and 
low-risk groups based on the predicted risks scores, and the survival risk differences between groups were 
investigated (step 6). Finally, the top critical prognostic markers were identified and interpreted using SHAP 
values (step 7) 
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install 
install 
install 
install 
install 


install 
install 


install 
install 
install 


installed. The software is free and can be downloaded from www. 
python.org/downloads/. We recommend using the Anaconda 
environment for Python and its libraries to run the experiments 
presented in this project. 

The data files used for this project and the complete codes 
notebooks are available at (https://github.com/Angione-Lab/ 
survival_analysis_tutorial). The repository includes the clinical and 
transcriptomic data (1.e., data_clinical_patient.csv, data_clinical_- 
sample.csv, data_mRNA_median_all_sample_Zscores.csv), which 
are required to run the following steps. 

After creating a new notebook, a new cell/field to run the 
codes needs to be created. By clicking on the “Run Cell” button, 
the code will be executed cell-by-cell. Finally, libraries and packages 
of Python need to be installed as shown below: 


dataprep # data exploration 
scikit-survival # survival analysis 


lifelines # plotting survival analysis 
gitthttps://github.com/smazzanti/mrmr # feature selection 
shap # model interpretation 


Other than the above packages, some primary data preproces- 
sing and visualization libraries such as Pandas, NumPy, Matplotlib, 
and Seaborn are expected to be installed if the code is run on a local 
machine. The syntax !pip install + libraries_names can be followed 
to install the preliminary packages. 


pandas # loading and preprocessing data 
numpy # loading and preprocessing data 
matplotlib # visualisation 

seaborn # visualisation 

-U scikit-learn # preparing ML algorithms 


Once the required libraries are installed, they need to be 
imported at the beginning of the notebook to use the relevant 
functions. 
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# Packages to load and preprocess data 
import numpy as np 
import pandas as pd 


# Packages to visualise and explore data 

import seaborn as sns 

sns.set_style("whitegrid") 

import matplotlib.pyplot as plt 

from dataprep.eda import plot, create_report, plot_missing, 
plot_correlation 


# Feature selection 
from mrmr import mrmr_classif 


# Packages to prepare data for ML 

from sklearn. preprocessing import OrdinalEncoder 

from sklearn.model_selection import GridSearchCV, KFold 
from sklearn.model_selection import train_test_split 
from sklearn. preprocessing import MinMaxScaler 

from sklearn.pipeline import Pipeline 


# Packages for survival analysis 

from lifelines import CoxPHFitter 

from lifelines.utils import k_fold_cross_validation 
from lifelines.statistics import logrank_test 

from lifelines import KaplanMeierFitter 

from lifelines.plotting import add_at_risk_counts 


# Packages for ML in survival analysis 

from sksurv.linear_model import CoxPHSurvivalAnalysis 

from sksurv.svm import FastSurvivalSVM 

from sksurv.ensemble import RandomSurvivalForest 

from sksurv.ensemble import GradientBoostingSurvivalAnalysis 
from sksurv.metrics import concordance_index_censored 


# Package to interpret data 
import shap 


3.4 Experiment 1: Following the workflow presented in Fig. 1, Experiment 1 was 
Clinical Data conducted to perform survival analysis on the clinical data. The 
data was first loaded into a data frame for data cleaning and explor- 
atory data analysis (EDA). Then, the CPH model and ML models 
were trained and evaluated to predict the survival risk of the 
patients. Finally, the results were interpreted to identify the critical 
clinical factors associated with low survival of the breast cancer 
patient. 
The details of the experiment are presented in the following 
sections. 


3.4.1 Load Data The patients information used in our analysis is stored in two files, 
one with clinical information and the other with the demographic 
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PATIENT ID LYMPH NODES EXAMINED POSITIVE NPI CELLULARITY CHEMOTHERAPY COHORT ER_IHC HER2 SNP6 


MB-0000 1 6.044 NaN N 1 Positve NEUTRAL 


lumn 


Fig. 2 First five rows of the merged data frame. The data is presented in a table with the clinical features as 
columns and patients as rows. As the data frame comprised many columns, only the first eight columns are 
displayed in this figure 


characteristics of the patients. These two files need to be merged 
into a single data frame for easy processing. 


# Load data 
filei = pd.read_csv(’Data/data_clinical_patient.csv’) 
file2 = pd.read_csv(’Data/data_clinical_sample.csv’) 


# Merge clinical data 
data = pd.merge(filei,file2, how="inner", on=["PATIENT_ID"]) 


Once the data was loaded and merged, the first five rows of the 
new data frame and its information were extracted to get an over- 
view of the data using the lines below. The outcome is reported in 
Fig. 2. 


# Have a quick look at data 
data. head () 


Next, an overview of the data frame information can be gener- 
ated by running the lines below. The output is displayed in Fig. 3. 


# Data information 
data. info () 


The data contained some missing values; hence, it is essential to 
understand the data and preprocess it carefully before implement- 
ing any predictive models. In the next section, different techniques 
to explore and clean the data are performed. 


3.4.2 Preprocess and In order to save computation time, duplicate observations and 

Explore Data unused columns were dropped before conducting exploratory 
data analysis. This is one of the fundamental data cleaning steps to 
prepare the data for further analysis: 
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<class 'pandas.core.frame.DataFrame'> 
Int64Index: 2509 entries, 0 to 2508 
Data columns (total 36 columns): 


t Column 


Non-Null Count 


QO PATIENT_ID 2509 
1 LYMPH_NODES_EXAMINED_POSITIVE 2243 
2 NPI 2287 
3 CELLULARITY 1917 
4 CHEMOTHERAPY 1980 
5 COHORT 2498 
6  ER_IHC 2426 
7  HER2_SNP6 1980 
8 HORMONE_THERAPY 1980 
9 INFERRED_MENOPAUSAL_STATE 1980 
10 SEX 2509 
11 INTCLUST 1980 
12 AGE_AT_DIAGNOSIS 2498 
13 OS_MONTHS 1981 
14 OS_STATUS 1981 
15 CLAUDIN_SUBTYPE 1980 
16 THREEGENE 1764 
17 VITAL_STATUS 1980 
18 LATERALITY 1870 
19 RADIO_THERAPY 1980 
20 HISTOLOGICAL_SUBTYPE 2374 
21 BREAST_SURGERY 1955 


non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 


object 
float64 
float64 
object 
object 
object 
object 
object 
object 
object 
object 
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Fig. 3 Clinical data information. The figure reports an overview of the clinical data frame, including total 
entries, data types, the names of the columns, and the number of validated data points. There are 2509 entries 
and 36 columns in the clinical data frame. The first 21 columns are shown in this figure, which include two 
types of data: (1) float or numeric and (2) object or non-numeric. Some columns contained missing values 
such as LYMPH_NODES_EXAMINED_POSITIVE, and NPI. This analysis provides a useful summary of the data 
before implementing any preprocessing steps 


VITAL_STATUS and SAMPLE_ID columns were dropped 
because they reported the same information as OS_STATUS 
and PATIENT ID, respectively. 


SEX and SAMPLE_TYPE columns had only a single value; 
hence, they were not providing any useful information for the 


predictive models and they were removed. 


RSF_STATUS and RSF_MONTHS were derived variables and 
were not used in our survival analysis. 
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Group Patients by CANCER_TYPE 
Breast Cancer 2506 
Breast Sarcoma 3 
Name: PATIENT_ID, dtype: int64 


After the preprocessing, the shape of data is: (2506, 29) 


Fig. 4 Output of preprocessing step. Only three breast sarcoma samples were present in the data; therefore, 
we dropped those three samples and left only one single value in the CANCER_TYPE column, i.e., normal 
breast cancer. As a result, since the CANCER_TYPE column reported the same value for all the samples, and it 
did not add any extra information about the samples, the column was removed and not included in the future 
steps of the analysis. Finally, after preprocessing, the final dataset consisted of 2506 samples and 29 features 


# Drop unused columns: Based on data.info(), we will drop some unused 
cols and null cols 


drop_list = [*VITAL_STATUS®, *SAMPLE_ID’*®, *SEX®, *SAMPLE_TYPE®, ° 
RSF_STATUS’, *RSF_MONTHS’] 
data = data.drop(drop_list, axis=1) 


We also checked the number of patients for each cancer type 
since the target of our study is breast cancer. The dataset included 
some breast sarcoma instances, a sporadic form of breast cancer. 
However, since a normal breast cancer prognosis is our primary 
objective, the data was filtered by CANCER_TYPE to keep normal 
breast cancer only. The lines below show the implementation of the 
filtering steps. The output of these steps is reported in Fig. 4. 


# We check the number of patients by cancer type 
print(’\nGroup Patients by’,data.groupby(’CANCER_TYPE’)[’PATIENT_ID’]. 
count () ) 


# There are only three patients with Breast Sarcoma 
# So we will filter those patients with Breast Cancer type 
data = data[data[’CANCER_TYPE’] == ’Breast Cancer’] 


# Delete Cancer type columns as this column reports the same value for 
all the samples, and it does not bring any useful information for 
the following steps of the analysis. 

data = data.drop([’CANCER_TYPE’], axis=1) 

print(’\nAfter the preprocessing, the shape of data is:’, data.shape) ) 


Before continuing the preprocessing phase (step 2 in Fig. 1), 
data was explored to investigate data types, data distribution, and 
missing values. The library dataprep was used for exploratory data 
analysis (EDA). Other options to explore specific parts of the 
report, such as missing values and data distribution, were also 
used, as shown in the code below. 
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# Understand data 
# Save to report as html file 
create_report (data).save(’Plot/EDA_clinical_report’) 


# Optional to explore parts of the report 
plot_missing (data) .save(’Plot/missing_values.html’) 
plot (data).save(’Plot/data.html’) 


The library generates an interactive EDA report that can be 
exported as an HTML file, as shown in Fig. 5, and opened in a web 
browser. This is a comprehensive report presenting all information 
about the features in the data frame. Besides the comprehensive 


DataPrep Report Overview 


Dataset Statistics Dataset Insights 
Number of Variables 2 (WPH_NCOES_DUATWES_pOstTIvE Nas 264 (10.53%) missing values hissing | 
Number of Rows 2306 wt has 222 (8.86%) missing values ca 
Missing Celts 10088 CHLLWLAATTY has 591 (23.58%) missing values c= 
Missing Celts (%) 13.9% Garomanary has 529 (21.11%) missing values [ sessing | 
OCuphcate Rows ° fa_mc | has 80 (3.19%) missing values [ bessing | 
Duplicate Rows (%) 00% san2_26 has 529 (21.11%) missing values c= 
Total Size in Memory 3.0mMe scrvout_teteary has 529 (21.11°G) missing values [ tassing | 
Average Row Size in Memory 12KB TNPERRED_PENOPAUSaL_sTaTE NSS 529 (21.11%) missing values | Missing | 
tatcLwsT has 529 (21.11%) missing values C= 
——— Categorical: 22 
Numerical: 7 ospowne Nas $28 (21.07%) missing values  issino | 
| 
Variables 
PATIENT_IO 
Approximate Distinct Count 2506 Os 
imate Unique 100.0% 5 06 
PATIENT_ID =—_ ™ naa a. 
emogoncal Messing r on 
Show Dotes’s Missing (%) 0.0%, 
San LAnID/ Inn Gan a 
Memory Size 177.2 KB & § Yt 
C PATIENT_IO 
LYMPH_NODES_ EXAMINED POSITIVE 
Approximate Distinct Count 2 Meon 1.9514 | 
Approximate Unique (%) 14% Minium ° 
00 
Mis: Maxierum “6 "7 
LYMPH_NODES.... ton -_ es 
ap Missing (%) 10.5% Zeros 1195 a 
c 
Show Detanis Infinite ° Zeros ("%) 411% 
Infinite (°%) 0.0% Negatives ° Py | 
Memory Size 350 KE Negatives (%) 0.0% ° 


Fig. 5 EDA report for clinical data. The report shows that the dataset consists of 29 features (22 categorical 
and 7 numerical features) and 2506 rows. There are no duplicate rows, and 10,088 missing values account for 
13.9% of the data. Besides, the report also reveals insights for each column (top-right panel), such as the 
number of missing values, skewness, unique number of values, and statistical summary. The distribution of 
each variable and information about missing values are also provided in the final report (bottom panel) 
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Bar Chart Spectrum Heat Map Dendogram 


Bl Present 
BB Missing 


Row Count 


Fig. 6 Missing values chart. The plot shows the missing values information by column. In the stacked column 
chart, the orange section represents the number of blank rows, whereas the blue represents the non-blank 
ones. As shown in the chart, no columns have more than 50% of the missing values 


report, the library allows to extract specific parts of the report. This 
can be achieved by selecting “Missing Values” in the menu at the 
top of the report page. For instance, the percentage of missing 
values for each variable is illustrated in Fig. 6. Once an overview 
of the data had been obtained, the next step was dealing with 
missing values. Our strategy was to remove the rows and columns 
with more than 50% of the missing values. Figure 6 shows that 
there were no columns with more than 50% of the missing values. 

We removed the rows with more than 50% of the blank values 
by running the lines of codes below. The output is reported in the 
first two rows of Fig. 7. 


# Deal with missing values 

# There is no columns more than 50% missing value 

cols_mv_50 = data.columns[data.isnull().mean() >0.5] 

print(’Number of columns having more 50% missing data’, len(cols_mv_50) 


) 


# Remove row with more than 50% missing 

percent = 50 

min_count = int(((100-percent)/100)*data.shape[1] + 1) 

data = data.dropna(axis=0, thresh=min_count) 

print(’After removing rows with more than 50% missing value:’, data. 
shape) 
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Number of columns having more 50% missing data: 0 
After removing rows with more than 50% missing values: (1977, 29) 
List columns having missing data: Index(['LYMPH_NODES_EXAMINED_POSITIVE', 
'CELLULARITY', 'ER_IHC', 'THREEGENE', 
"LATERALITY', 'HISTOLOGICAL_SUBTYPE', 'BREAST_SURGERY', 'GRADE', 
'TUMOR_SIZE', 'TUMOR_STAGE'], 
dtype='object') 
After preprocessing, missing value number: 0 


Fig. 7 Output of dealing with missing values steps. There are no columns that missed more than 50% of the 
values. After removing the rows with more than 50% of the missing values, the data’s remaining rows are 
1977. Also, 10 columns, namely LYMPH_NODES_EXAMINED_POSITIVE, CELLULARITY, ER_IHC, THREEGENE, 
LATERALITY, HISTOLOGICAL_SUBTYPE, GRADE, TUMOR_SIZE, TUMOR_STAGE, BREAST_SURGERY, contain 
missing values, which are replaced by their mode (if categorical) or their average (if numeric). Once the 
preprocessing steps are completed, no missing values are found in the dataset 


After removing the rows with more than 50% of the missing 
values, we replaced any missing values with their mode (for cate- 
gorical variables) and their average (for numeric variables). To 
achieve this, first, we identified which columns contained blanks 
and classified them into either categorical or continuous numeric 
types. Once this step was completed, we checked again the number 
of missing values to ensure there were no other missing values in 
the data. The output of the below codes is displayed in Fig. 7. 


# Print columns name having blanks 
cols_missvalue = data.columns[data.isnull().sum() >0] 
print(’List columns having missing data:’, cols_missvalue) 


cat_var = [’LYMPH_NODES_EXAMINED_POSITIVE’, ’CELLULARITY’, ’ER_IHC’, ° 
THREEGENE’, ’LATERALITY’, ’HISTOLOGICAL_SUBTYPE’, ’BREAST_SURGERY’, 
>GRADE’, *TUMOR_STAGE’] 
num_var = [’TUMOR_SIZE’] 


# Replace missing values with most frequent values 
data[cat_var] = data[cat_var].fillna(data[cat_var].mode().iloc[0]) 


# Replace missing values with average values 
data[num_var] = data[num_var].fillna(data[num_var].mean()) 


# Check missing values again 
print(’Missing value number:’, data.isna().sum().sum()) 
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Fig. 8 Features distribution in clinical data. Excluding PAT/ENT_/D, the remaining 28 features, including 
OS_MONTHS and OS_STATUS, are plotted. The figure shows that L YMPH_NODES_EXAMINED_POSITIVE and 
OS_MONTHS are right-skewed, while the CELLULARITY values, i.e., the amount of tumor cells, are mostly 


high, followed by moderate and low status 


Before moving into the next step of the pipeline, distribution 
visuals for each variable were plotted using the below line of code. 


The output is reported in Fig. 8. 


# Exploring clean data 


plot(data.iloc[: ,1:]).save(’Plot\preprocessed_data.html’) 


In the following step (step 2 in Fig. 1), some categorical 
variables were encoded to numeric to process and analyze data. 
First, we prepared a list of features/columns to be encoded. Then 
the OrdinalEncoder function was used to transform the columns in 


the list into numeric. 
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# Encode categorical data 


# Encode OS status to dummy 
data[’0OS_STATUS ’]=np. where (data[’0S_STATUS’]==’1:DECEASED’, 1, 0) 


# Encode other categorical variables 

other_var = [’LYMPH_NODES_EXAMINED_POSITIVE’, ’NPI’,’AGE_AT_DIAGNOSIS’, 
>COHORT’, ’GRADE’, ’TUMOR_SIZE’, ’*TUMOR_STAGE’, ®TMB_NONSYNONYMOUS 
?, ?OS_MONTHS’, ’OS_STATUS’,’PATIENT_ID’] 

df_encode = data.drop(other_var, axis=1) 


# Some variables’ values are not in order, so we have to specify the 
variables and their corresponding orders 

modified_list =[’CELLULARITY’, ’*HER2_SNP6’, ’INFERRED_MENOPAUSAL_STATE’ 
, ?INTCLUST’?, *THREEGENE?’] 


keep_list = df_encode.columns [~df_encode.columns.isin(modified_list)] 

cel_cat = [’Low’, ’Moderate’, ’High’] 

her2_cat = [’UNDEF’, ’LOSS’, ’NEUTRAL’, ’GAIN?’] 

inf_cat = [’Pre’, ’Post’] 

inticliustcatem emt 222) eal SER ose a eR tye Demo mcrier Sits amon. 
2107] 


three_gene_cat = [’ER-/HER2-’, ’*HER2+’, ’ER+/HER2- Low Prolif’, ’ER+/ 
HER2- High Prolif’] 


# Encode the predefined order variables 


enc = OrdinalEncoder(categories=[cel_cat, her2_cat, inf_cat, 
intclust_cat, three_gene_cat]).fit(df_encode [modified_list]) 
encoder = enc.transform(df_encode [modified_list]) 


df_encode_new = pd.DataFrame(encoder, columns=modified_list) 


# Encode the other variables 

enci = OrdinalEncoder().fit(df_encode[keep_list]) 

encoderi = enci.transform(df_encode[keep_list]) 
df_encode_new1l = pd.DataFrame(encoderi, columns=keep_list) 


Finally, the columns were concatenated to the other numeric 
columns to generate the final data frame. 


# Merge encode data and original data 

df =pd.concat([df_encode_new, df_encode_newi, data[other_var]. 
reset_index(drop=True)], axis=1) 

print (df. shape) 


To check the mapping between the encoded categories and the 
original ones, the code below can be executed. 


# To check the encoded categories 
for i in range(len(col)): 


print(col[i], enc.categories_[i]) 
for i in range(len(keep_list)): 
print (keep_list[i], enci.categories_[i]) 
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CELLULARITY ['Low' 'Moderate' 'High'] 
HER2_SNP6 ['UNDEF' 'LOSS' 'NEUTRAL' 'GAIN'] 
INFERRED_MENOPAUSAL_STATE ['Pre' 'Post'] 
INTCLUST ['1' '2" '3' '4ER+' '4ER-' '5' '6" "7! '8* "9! 110") 
THREEGENE ['ER-/HER2-' 'HER2+' 'ER+/HER2- Low Prolif' 'ER+/HER2- High Prolif'] 
CHEMOTHERAPY ['NO' 'YES'] 
ER_IHC ['Negative' 'Positve'] 
HORMONE_THERAPY ['NO' 'YES'] 
CLAUDIN_SUBTYPE ['Basal' 'Her2' 'LumA' 'LumB' 'NC' 'Normal' 'claudin-low'] 
LATERALITY ['Left' 'Right'] 
RADIO_THERAPY ['NO' 'YES'] 
HISTOLOGICAL_SUBTYPE ['Ductal/NST' 'Lobular' 'Medullary' 'Metaplastic' 'Mixed' 
‘Mucinous' 
'Other' 'Tubular/ cribriform'] 
BREAST_SURGERY ['BREAST CONSERVING' 'MASTECTOMY'] 
CANCER_TYPE_DETAILED ['Breast' 'Breast Invasive Ductal Carcinoma’ 
‘Breast Invasive Lobular Carcinoma' 
‘Breast Invasive Mixed Mucinous Carcinoma' 
‘Breast Mixed Ductal and Lobular Carcinoma' ‘Invasive Breast Carcinoma’ 
'Metaplastic Breast Cancer'] 
ER_STATUS ['Negative' 'Positive'] 
HER2_STATUS ['Negative' 'Positive'] 
ONCOTREE_CODE ['BRCA' 'BREAST' 'IDC' 'ILC' 'IMMC' 'MBC' 'MDLC'] 
PR_STATUS ['Negative' 'Positive'] 


Fig. 9 Output of the mapping between the encoded and original categories. The figure shows the original 
values for each encoded categorical column. The original categories are presented in ascending order based 
on their corresponding encoded values 


Next, the clean data was saved in a CSV file, clinical.csy, in the 
Data folder to be used for the following analysis and to be 
integrated with transcriptomic data (Fig. 9). 


# Save preprocess data to csv to merge to gene data 
df.to_csv(’Data/clinical.csv’, index=False) 


Correlation analysis was performed to understand the relation- 
ship between features in the data. The heat map in Fig. 10 presents 
the Pearson correlation matrix where the varying intensity of color 
represents the values of correlation. There were some highly corre- 
lated features observed in the data, such as ER_STATUS and 
ER_IHC, and AGE_AT_DIAGNOSIS and INFERRED_MENO- 
PAUSAL_STATE. 


# Drop Patient ID column as this is not relevant for the analysis 
df = df.drop([’PATIENT_ID’], axis=1) 


# Correlation analysis 
colormap = plt.cm.Reds 
plt.figure(figsize=(12,10)) 
sns.heatmap(df.corr() , linewidths=0.1,vmax=0.8, 

square=True, cmap = colormap, linecolor=’white’) 
plt.title(’Correlation matrix’, fontsize=14) 
plt .show() 
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Correlation matrix 


CELLULARITY [i 

HER2_SNP6 | | 
INFERRED_MENOPAUSAL_STATE |_| 
INTCLUST |) 

THREEGENE |_| 

CHEMOTHERAPY || 

ER_IHC | 

HORMONE_THERAPY || 
CLAUDIN_SUBTYPE 
LATERALITY |) 

RADIO_THERAPY 
HISTOLOGICAL_SUBTYPE 
BREAST_SURGERY 
CANCER_TYPE_DETAILED 
ER_STATUS | _ 

HER2 STATUS |_| 
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PR_STATUS | 

LYMPH_NODES EXAMINED POSITIVE || 
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Fig. 10 Correlation matrix of clinical features. The correlation matrix depicts the linear correlation between all 
the pairs of attributes and ranges from —1 (perfect negative correlation) to +1 (perfect positive correlation), 
with the value of zero being no correlation between the features. Color density represents the correlation’s 
values, where the darker color implies higher values and the lighter color implies lower ones. The figure shows 
some high correlated features in the data, such as ER_STATUS and ER_IHC; AGE_AT_DIAGNOSIS and 
INFERRED_MENOPAUSAL_ STATE 


Since the next steps of the pipeline are based on survival analy- 
sis, we calculated the percentage of censored data using the lines of 
code below. Overall, there was 42.2% of the censored information. 


num_censored = df.shape[0] - df["OS_STATUS"].sum() 
print("%.1£%% of records are censored" % (num_censored/df.shape [0] #100) 


) 


Then, the follow-up time distribution of death and censored 
patients was plotted using the code below. The final chart is shown 
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Fig. 11 Distribution of follow-up times of censored and observed (death) events. 42.2% of the total 
observations were censored. The distribution is right-skewed and is different between censored patients 
and those who experienced the event. The censored group has more patients with longer survival times 


in Fig. 11. This step allows a further investigation into the time-to- 
event distribution for censored/non-censored patients. 


# Time Distribution of Death and Censor 

plt.figure(figsize=(9, 6)) 

val, bins, patches = plt.hist((df.query(’OS_STATUS == 1’)[’OS_MONTHS’], 
df.query(’OS_STATUS == 0’)[’OS_MONTHS’]), 

bins=30, stacked=True) 

_~ = plt.legend(patches, ["Time of Deaths", "Time of Censored"]) 

plt.title("Time Distribution of Censored and Death Patients") 


3.4.3 Plot Cox In the next step of our pipeline (step 5 in Fig. 1), the CPH model 

Proportional Hazards Model _— was fitted on the clinical data. The results were then visualized and 
reported to view the coefficients and ranges of features. Before 
running the analysis, data needed to be normalized. Min—max 
normalization, one of the most popular methods to normalize 
data, was applied. The method is based on the formula in Eq. 15, 
and the transformed data values range between 0 and 1. 
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Xscaled = ss ane (15) 
max(x) — min(x) 
# Cox survival analysis 
# Normalise data 
ss = MinMaxScaler() 
df_norm = df.drop([’OS_STATUS’, ’OS_MONTHS’], axis = 1) 
df_norm = pd.DataFrame(ss.fit_transform(df_norm), columns=df_norm. 
columns) 


df [’OS_STATUS’] 
df [’OS_MONTHS’] 


df_norm[’OS_STATUS’] 
df_norm[’OS_MONTHS’] 


The next step was to use the entire dataset to fit the Cox 
regression model, and the final results were plotted using the 
code below. 


# Build model 

# Cox proportional hazards model 

cph = CoxPHFitter () 

cph.fit(df_norm, duration_col=’OS_MONTHS’, event_col=’OS_STATUS’) 


# Plot 

plt.figure(figsize=(9, 12)) 

plt.title(’Cox Proportional Hazards Model for Clinical data’) 
cph.plot () 


# Report 
cph.print_summary (columns=["coef","exp(coef)","exp(coef) lower 95%"," 
exp(coef) upper 95%", "z", "p"], decimals=3) 


The hazard ratio of each feature and its statistical report are 
presented in Figs. 12 and 13, respectively. According to Fig. 12, 
AGE_AT_DIAGNOSIS was found as the most significant factor 
associated with the death events with the coefficient or hazard ratio 
value of 3.753. To be specific, elderly patients were 3.753 times as 
likely to die as younger ones. LYMPH_NODES_EXAMINED_- 
POSITIVE was the second critical factor among the clinical data. 
Patients with positive lymph nodes tended to have a risk of death 
1.888 times higher compared to those who did not have positive 
lymph nodes. As shown in Fig. 13, the overall C-index of this 
model is 0.685, which shows an acceptable predictive model. 

The advantage of fitting the entire dataset to a regression model 
is that more data is fitted to the CPH model, which usually 
increases the accuracy of the model. Besides, the predictive capabil- 
ities of the CPH model fitted can be evaluated to see how well the 
algorithm performs on the entire data. However, the generalization 
of the model cannot be assessed if the entire data is fitted to the 
model, and it is usually considered less trustworthy. Hence, cross- 
validation can be performed to reduce selection bias and overfit- 
ting. This approach also provides more insight into how well the 
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Fig. 12 Results of the Cox proportional hazards model for clinical data. The log-hazard ratio is plotted for all 
the features, with a 95% confidence interval (Cl). AGE_AT_DIAGNOSIS, LYMPH_NODES_EXAMINED_POSITIVE, 
and TUMOR_SIZE were found as the top three most significant factors associated with the death events with 
the log(HR) values of 3.753, 1.888, and 1.189, respectively. In other words, patients having higher values of 
these three predictors are more likely to have lower survival times. In contrast, the less than zero log(HR) value 
predictors, such as HISTOLOGICAL_SUBTYPE and INFERRED_MENOPAUSAL_STATE, were negatively asso- 
ciated with the death event. Patients with higher values of these factors tend to live longer compared to those 
who have low values 


model will perform on unseen data. Therefore, the next step of the 
analysis was to conduct a fivefold cross-validation to get an average 
C-index and generate more robust prediction scores. Specifically, a 
fivefold cross-validation approach splits the data into fivefold, four 
of which are used as a training set to fit the model. The fitted model 
is then evaluated on the left-out fold and a C-index is computed. 
The process is repeated for all the possible combinations of training 
and testing sets using the fivefold. The final C-index is calculated as 
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model lifelines.CoxPHFitter 

duration col ‘OS_MONTHS' 

event col ‘OS_STATUS' 

baseline estimation breslow 
number of observations 1977 
number of events observed 1143 
partial log-likelihood -7603,988 


time fit was run 2021-12-23 12:16:40 UTC 


coef exp(coef) exp(coef) lower 95% exp(coef) upper 95% z Pp 

CELLULARITY -0.107 0.899 0.749 1,079 -1.147 0.251 

HER2 SNP6 0.298 1.347 0.853 2.129 1277 = 0.201 

INFERRED MENOPAUSAL STATE -0.481 0.618 0.488 0.783 -3.999 <0.0005 

INTCLUST -0.027 0.973 0.794 1.193 -0.264 0.792 

THREEGENE 0.428 1,535 1.189 1.981 3.286 0.001 

CHEMOTHERAPY 0.324 1.383 1.106 1.728 2.845 0.004 

ERIHC 0.014 1,014 0.766 1.343 0,099 0.921 

HORMONE THERAPY  -0.061 0.940 0.810 1,091 -0.809 0.418 

CLAUDIN SUBTYPE -0.029 0.971 0.771 1.224 -0.246 0.806 

LATERALITY -0.106 0.899 0.799 1.012 -1.765 0.078 

RADIO THERAPY -0.179 0.836 0.716 0.975 -2.280 0.023 

HISTOLOGICAL SUBTYPE -0.571 0,565 0.324 0.984 -2.015 0,044 

BREAST_SURGERY 0.087 1,091 0.935 1.272 1.103 0.270 

CANCER TYPE DETAILED -0.060 0.942 0.547 1.624 -0.215 0.830 

ER_STATUS -0.355 0.701 0.527 0.933 -2.432 0.015 

HER2 STATUS 0.216 1.241 0.983 1,565 1.819 0.069 

ONCOTREE CODE 0.641 1,898 1,099 3.276 2,300 0.021 

PR_STATUS -0.071 0.932 0.809 1.073 -0.983 0.326 

LYMPH_NODES EXAMINED POSITIVE 1.888 6.607 3.451 12.648 5.699 <0.0005 

NPI 0.597 1.817 1.162 2.841 2.618 0.009 

AGE_AT_DIAGNOSIS 3.753 42.632 24.263 74.906 13.049 <0.0005 

COHORT 0.109 1.115 0.901 1.380 1.003 0.316 

GRADE 0.117 1.124 0.878 1.439 0.929 0.353 

TUMOR SIZE 1.189 3.285 1.717 6.286 3.592 <0,0005 

TUMOR STAGE 0.657 1.928 1.127 3.301 2.395 0.017 

TMB_NONSYNONYMOUS = 0.008 1.008 0.331 3.068 0.014 0.989 
Concordance 0.685 
Partial AIC 15259.976 


log-likelihood ratio test 497.155 on 26 df 


-log2(p) of Il-ratio test 291.895 


Fig. 13 Cox proportional hazards report for clinical data. The report indicates that OS_MONTHS was the 
duration variable, while OS_STATUS was the event variable used for survival analysis. The figure also reports 
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the average of the five C-index values generated during the fivefold 
cross-validation process. The code below can be run to perform 
cross-validation and generate the average C-index (Fig. 14). 


# Cross validation (optional) 

scores = k_fold_cross_validation(cph, df_norm, ’*OS_MONTHS’, event_col=’ 
OS_STATUS’, k=5, scoring_method="concordance_index", seed=18) 

print("Average score", round(np.mean(scores) ,3)) 


3.4.4 Set Up and In order to run the ML algorithms (step 6 in Fig. 1), the following 
Evaluate Machine Learning steps were applied: 
Algorithms 


1. Data was split into training and testing sets using a stratified 
split with a ratio of 80:20. 


2. The machine learning models were trained using a fivefold 
cross-validation approach on the training set. Grid search was 
applied to autotune hyperparameters to get optimal solutions. 


3. The trained models were applied to the testing set to generate 
patient-specific predictions. 

4. Steps 1-3 were repeated 20 times on different splits of training 
and testing sets to obtain an average C-index. This process 
provides a more robust evaluation of the models since it is 
not dependent on the training—testing split. 


5. The prediction scores generated by the models were used to 
separate the patients in the testing set into higher risk and lower 
risk to investigate any significant difference in the survival rates 
of the two groups. 


The five steps presented above are further discussed and illu- 
strated below. First, we set up a seed value to ensure the reproduc- 
ibility of the results. Then, the data was arranged into a data frame 
X containing the prognostic attributes and a y data frame 


Average score 0.677 


Fig. 14 Average C-index of fivefold cross-validation for Cox proportional hazards 
models. The figure shows the average C-index generated during the fivefold 
cross-validation process. The final C-index was 0.677, which was lower than the 
C-index of 0.685 reported in Fig. 13 


< 
Fig. 13 (continued) the HR values (exp(coef)), with the corresponding 95% confident interval, and p-values of 
the clinical features. The accuracy prediction of the CPH model, i.e., the C-index, was 0.685, which indicates 
an acceptable model. Similar to the results presented in Fig. 12, AGE_AT_DIAGNOSIS, LYMPH_NODES_EX- 
AMINED_POSITIVE, and TUMOR_SIZE were identified as the top three most significant factors associated with 
the death event with a p-value less than 0.0005, and coefficient/log(HR) values of 3.753, 1.888, and 1.189, 
respectively 
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containing the target variables (survival time and status). The new 
data frames were split into training and testing sets using a random 
and stratified approach with a ratio of 80:20. 


# Set up seed and the options for the cross-validation approach 
SEED = 5 
CV = KFold(n_splits=5, shuffle=True, random_state=0) 


# Split data to prepare for ML 

X = df.drop([’OS_MONTHS’,’OS_STATUS’], axis = 1) 
df[’?OS_STATUS’] = np.where(df[’OS_STATUS’] == 1, True, False) 
y = df([’OS_STATUS’ ,’O0S_MONTHS’]].to_records (index=False) 


# Split the data set into training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size 
=0.2, stratify=y[’OS_STATUS’],random_state=SEED) 


Once data was prepared, the ML models were applied by 
defining a function to train and evaluate the procedure. We used 
grid search with fivefold cross-validation to train and tune the 
hyperparameters for each estimator. Then, we applied the optimal 
algorithms to generate the final prediction on the testing set. The 
function returns the optimal model and C-index. 


# Build model 

# Define a function for grid search to tune training model 

# and predict the results 

def grid_search(estimator, param, X_train, y_train, X_test, y_test, CV) 


# Define Grid Search 
gcv = GridSearchCV(estimator, param_grid=param, cv=CV, 
n_jobs=-1).fit(X_train, y_train) 


# Find best model 
model = gcv.best_estimator_ 
print (model) 


# Generate predictions 

prediction = model.predict (X_test) 

result = concordance_index_censored(y_test["OS_STATUS"], y_test[" 
OS_MONTHS"], prediction) 

print(’C-index for test set (Hold out):’, result [0]) 


return [model, prediction] 


Next, to avoid bias in our final evaluation, we ran each ML 
model 20 times. By defining the below function, the number of 
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re-run times can be easily changed by modifying the value of n. The 
function below randomly generates » different seeds, one for each 
iteration. Training and testing set splitting is performed in each 
loop, the data is fitted to identify the optimal algorithm, and the 
final model is evaluated on unseen data. By doing so, we randomly 
created m different testing sets and evaluated each algorithm 
n times. Finally, the average results of the ” runs were calculated 
and reported. 


# Re-run experiment 20 times 

def c_index(model, X, y, n=20): 
np.random. seed (1) 
seeds = np.random.permutation(1000) [:n] 


# Train and evaluate model with 20 times 
cindex_score = [] 
predict_list = [] 


for s in seeds: 

X_trn, X_test, y_trn, y_test = train_test_split(X, y, test_size 
=0.2, stratify=y[’OS_STATUS’], random_state=s) 

model.fit(X_trn, y_trn) 

prediction = model.predict (X_test) 

predict_list.append (prediction) 

result = concordance_index_censored(y_test ["OS_STATUS"],y_test[ 
"OS_MONTHS"], prediction) 


cindex_score.append(round (result [0] ,3)) 


print(’Average C-index for {} runs’.format(n), np.mean(cindex_score 


)) 


return [cindex_score, predict_list] 


After defining the two functions above for the ML process, we 
designed the experiment pipeline by specifying the algorithms and 
establishing their hyperparameters. Before applying the algorithms, 
all the data had to be normalized using min—max normalization. 
Different values of ridge regression penalty were tested to tune the 
CPH model (the values varied between 0.001 and 100, as shown in 
Table 1). 


# Define the Pipeline and hyperparameter 
# CoxPHSurvivalAnalysis 
pipe_cox = Pipeline([(’scaler’, MinMaxScaler()),(’model’, 


CoxPHSurvivalAnalysis())]) 
param_cox ={’scaler’: [MinMaxScaler()], 
"model__alpha": [0.001, 0.01, 0.1, 1, 10, 100]} 


Table 1 


Machine Learning Methods for Survival Analysis of Breast Cancer 357 


Hyperparameters of the models. Each method was parametrized and trained using a fivefold cross- 
validation approach. Grid search was used with different hyperparameters while maximizing the 


C-index 

Models Hyperparameters name Hyperparameters set Selected value 

CPH Ridge regression parameter [0.001, 0.01, 0.1, 1, 10, 100] 1 

RSF max_features sqrt sqrt 
max_depth 8 8 
min_samples_leaf [50, 100] 50 
min_samples_split 100 100 
n_estimators 500 500 

GBS learning_rate (0.01, 0.1, 1] 0.1 
n_estimators [200, 500, 800, 1000] 200 

SSVM Optimizer [avltree, rbtree, simple | avltree 
max_iter [500, 5000] 500 


Then, we set up the hyperparameters for the three ML-based 
algorithms, namely RSF, GBS, and SSVM. In the RSF algorithm, 
the m_estimators and the max_depth hyperparameters can be set to 
specify the number of trees and the maximum depth of the tree in 
the forest, while the min_samples_leaf and the min_samples_split 
parameters can be set to specify the minimum number of samples 
required to be at a leaf node, and the minimum number of samples 
required to split an internal node, respectively. The deeper the tree 
grows in the forest, the more complex the model, which can easily 
lead to overfitting and increased computational complexity. In 
order to avoid these problems, a predefined max_depth parameter 
can be set; otherwise, the trees are grown until each leaf contains 
less than min_samples_split samples. The max_features hyperpara- 
meter can also be defined to set the number of features to consider 
when looking for the best split. By default, the algorithm considers 
all the features and selects the one with the optimal metric to 
perform the split. If the max_features parameter is set equal to 
sqrt, the maximum number of features considered at each split is 
equal to the square root of the total number of features in the 
dataset. Reducing the number of features can save computational 
resources, increase the stability of the forest, reduce variance, and 
overfitting. 
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# Random Survival Forests 
pipe_rsf = Pipeline([(’scaler’, MinMaxScaler()),(’model’, 
RandomSurvivalForest())]) 

param_rsf ={’scaler’: [MinMaxScaler()], 
>model__random_state’: [SEED], 
>model__max_features’: [’sqrt’], 
?>model__max_depth’: [8], 
»>model__min_samples_leaf’: [50, 100], 
>model__min_samples_split’: [100], 
»>model__n_estimators’:[500]} 


In the GBS algorithm, the _estimators parameter can be used 
to set the number of trees to generate, while the learning_rate 
parameter can be set to regulate the learning rate that shrinks the 
contribution of each tree. The GBS model is robust to overfitting, 
so a higher value of the m_estimators parameter often results in 
better performance. However, there is a trade-off between 1_esti- 
mators and learning_rate. Thus, different combinations of the list 
of values of the above hyperparameters were tried in the tuning 
phase. 


# Gradient Boost Survival 
pipe_gbs = Pipeline([(’scaler’, MinMaxScaler()),(’model’, 
GradientBoostingSurvivalAnalysis ())]) 
param_gbs ={’scaler’: [MinMaxScaler()], 
»model__random_state’: [SEED], 
>model__learning_rate’: [0.01, 0.1, 1], 
>model__n_estimators’:[200, 500, 800, 1000]} 


The hyperparameters defined for the SSVM algorithm included 
the optimizer, which refers to the optimization techniques, such as 
the AVL tree (avltree), the red-black tree (rbtree), and the simple 
methods. The max_iter parameter can be set to define the maxi- 
mum number of iterations to perform in the Newton optimization. 
These hyperparameters are necessary to design an effective and 
efficient SSVM model. A summary of the hyperparameters tuned 
for each model using grid search and their final values are presented 


in Table 1. 
# Survival SVM 
pipe_svm = Pipeline([(’scaler’, MinMaxScaler()),(’model’, 
FastSurvivalSVM())]) 
param_svm ={’scaler’: [MinMaxScaler()], 


?model__random_state’: [SEED], 
>model__max_iter’: [500, 5000], 
?model__optimizer’:[’avltree’, ’rbtree’,’simple’]} 
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Once data preparation and models design were completed, an 
estimator’s dictionary, containing the pairs of names of the algo- 
rithms, and their corresponding pipelines were generated to sup- 
port looping the same procedure for each model. 


# Estimator list: 


estimator_list = {’Cox Regression’:[pipe_cox, param_cox ], 
>Random Forest Survival’:[pipe_rsf, param_rsf], 


*Gradient Boosting Survival’: 
[pipe_gbs, param_gbs], 
?SVM Survival’: [pipe_svm, param_svm]} 


Since the training and testing phases for each algorithm follow 
the same approach, we put them into a list of estimators and iterate 
the same procedure over this list. The output of the procedure 
displays the optimal algorithm, the holdout-test results, and the 
average C-index for each model, as shown in Fig. 15. The results 
show that the average C-index over 20 runs of the three ML-based 
models outperformed CPH, a well-known statistical approach for 
survival analysis. SSVM had the highest average C-index with a 
value of 0.688, followed by RSF, GBS, and CPH with C-indices 
of 0.685, 0.683, and 0.678, respectively. 


model_list = [] 
pred_list = [] 
c_index_list = [] 
pred_list_n = [] 


for model_name, index in estimator_list.items(): 
print (’\n’,model_name) 


estimator = index[0] 

param = index[1] 

outcome = grid_search(estimator, param, X_train, y_train, 
X_test, y_test, CV) 

model = outcome [0] 


pred_list.append (outcome [1] ) 


# Run model n times to check C-index 
score, pre = c_index(model, X, y, n=20) 
c_index_list.append (score) 
pred_list_n. append (pre) 


Boxplots were then used to visualize and compare the distribu- 
tions of C-index values for the 20 runs for each model (Fig. 16). On 
average, SSVM had the highest performance, followed by RSF, 
GBS, and CPH. 
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Cox Regression 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', CoxPHSurvivalAnalysis(alpha=0.1))]) 
C-index for test set (Hold out): 0.660715999616086 
Average C-index for 20 runs 0.6778500000000001 


Random Forest Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 

RandomSurvivalForest (max_depth=8, max_features='sqrt', 
min_samples_leaf=50, 
min_samples_split=100, n_estimators=500, 
random_state=5))]) 

C-index for test set (Hold out): 0.6678184086764565 
Average C-index for 20 runs 0.6854499999999998 


Gradient Boosting Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 
GradientBoostingSurvivalAnalysis(n_estimators=200, 
random_state=5))]) 
C-index for test set (Hold out): 0.667875995776946 
Average C-index for 20 runs 0.6825 


SVM Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 
FastSurvivalSVM(max_iter=500, optimizer='avltree', 
random_state=5) )]) 
C-index for test set (Hold out): 0.6747288607351953 
Average C-index for 20 runs 0.68755 


Fig. 15 CPH and ML models’ results for clinical data. The selected hyperparameters, initial test result, and the 
average C-index of each model are displayed in the outcome. Overall, the average performance over 20 runs 
of the three ML-based models outperformed the CPH model, a well-known statistical approach for survival 
analysis. SSVM had the highest average C-index with a value of 0.688, followed by RSF, GBS, and CPH 


# Visualise results 
mame = [’CPH’, ’RSF’, °’GBS’, ’SSVM?] 
cv_res = [] 


for i in range(0,4): 
for c in c_index_list [i]: 
cv_res.append([name [i] ,c]) 


c_plot = pd.DataFrame(cv_res, columns=[’Model Name’, ’C-index’]) 
ax = sns.boxplot(x="Model Name", y="C-index", data=c_plot) 
plt.title(’C-index for 20 runs’) 
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Fig. 16 C-index comparisons for Experiment 1. Boxplots of C-index results of clinical data using CPH, RSF, 
GBS, and SSVM. The experiments were replicated 20 times. In each experiment, the data was randomly 
divided into training and testing sets with a ratio of 80:20 while guaranteeing the same censoring percentage 
on each subset of data. SSVM was found to have the highest median C-index, followed by RSF, GBS, and CPH 


The patients in the testing set were then ranked by their pre- 
dicted risk score and split into two equal-sized groups using the 
median risk score. High-risk groups included patients with prog- 
nostic risk scores greater than or equal to the median value, while 
low-risk groups included those with prognostic risk scores below 
the median value. 

In the next step of the pipeline (step 6 in Fig. 1), Kaplan—Meier 
plot and log-rank tests were conducted for all the models to statis- 
tically investigate the differences between the survival curves of the 
two groups. Figure 17 reveals that the lower-risk patients, or those 
with lower predicted risk scores, were associated with better sur- 
vival outcomes (i.e., higher survival probability). Besides, there 
were statistically significant differences in the survival distributions 
of high-risk and low-risk patients for all four models ( p-values < 
0.0001). Log-rank test was used to assess the statistical significance 
and compute the p-value. This analysis shows that the clinical 
factors can be used to split the patients into risk groups based on 
their predicted scores. GBS was the best model in prognostic 
diagnosis with a p-value of 5.918E-12. 
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Fig. 17 Kaplan—Meier curves to compare the high-risk and low-risk breast cancer groups, stratified by the 
predicted survival risk scores generated by the four models. The low-risk group (n = 198) included patients 
with predicted risk scores above the median value, while the high-risk group (n = 198) comprised those less 
than the median value. Also, the p-value from the log-rank test was calculated to determine the statistical 
significance of the difference in survival functions between the two groups. The figure shows statistically 
significant differences in survival distributions between the two groups for all four models with a p-value lower 


than 0.0001 


fig, ax = plt.subplots(2,2,figsize=(12,12)) 


k 


for pred in pred_list: 
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= 0 


dfi = X_test.reset_index (drop=True) 
risk =[] 

y-pred = pred 

med = np.median(y_pred) 

r = np.where(y_pred >= med, 1, 0) 
dfi(’Risk’?] =r 

print (df1.shape) 

i= dtl [7 Risk?)) ==) 1 


df_y = pd.DataFrame(y_test) 

df_y[’OS_STATUS’] = np.where(df_y[’OS_STATUS’] == True, 1, 0) 
dfi[’OS_STATUS’]= df_y[’OS_STATUS’] 

dfi[’OS_MONTHS’]= df_y[’OS_MONTHS’] 

T_hr, E_hr =df1.loc[ix][’OS_MONTHS’],df1.loc[ix][’OS_STATUS’] 
T_lr, E_lr = dfi.loc[~ix][’?OS_MONTHS’], df1.loc[~ix][’?OS_STATUS°?] 


# Set-up plots 
k+=1 
plt.subplot (2,2,k) 


# Fit survival curves 

kmf_hr = KaplanMeierFitter () 

ax = kmf_hr.fit(T_hr, E_hr, label=’HR’).plot_survival_function() 
kmf_lr = KaplanMeierFitter () 

ax = kmf_lr.fit(T_lr, E_lr, label=’LR’).plot_survival_function () 
add_at_risk_counts(kmf_lr, kmf_lr) 


# Format graph 

plt.ylim(0,1); 

ax.set_xlabel(’Timeline (months)’,fontsize=’large’) 
ax.set_ylabel(’Percentage of Population Alive’,fontsize=’large’) 


# Calculate p-value 

res = logrank_test(T_hr, T_lr, event_observed_A=E_hr, 
event_observed_B=E_lr, alpha=.95) 

print(’\nModel’, name[k-1]) 

res.print_summary () 


# Locate the label at the ist out of 9 tick marks 
xloc = max(np.max(T_hr),np.max(T_lr)) / 10 
ax.text(xloc,.2, res.p_value ,fontsize=15) 
ax.set_title(’KM Curves {}’ .format (name[k-1i])) 


plt.tight_layout () 


3.4.5 Interpret Model Clinicians can rely on a predictive model when its outcome can be 


interpreted. This is especially crucial for the healthcare domain, 
where every decision relates to human life. Interpretability of ML 
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can be defined as the extent to which an individual can understand 
the cause of the predicted outcome [72]. Shapely Additive Expla- 
nations values (SHAP) [47] can be applied to interpret the results 
of the ML models run in the previous section. SHAP values repre- 
sent a unified approach to interpret predicted outcomes made by 
complex ML algorithms. This explainable approach has gained 
much attention from researchers, and it has been increasingly 
applied in many fields, including medical and oncology applications 
[Se #2 | 

As shown in step 7 in Fig. 1, SHAP values can be used to 
measure the importance of the features by calculating the impact of 
each estimator on the model prediction. In other words, it mainly 
focuses on explaining the importance or the weight of a specific 
feature on the model prediction. Each patient is represented by one 
data point with positive or negative values indicating the direction 
of the impact. The higher the SHAP value associated with the 
patient, the higher the mortality risk. For example, for the age 
feature, a 20-year-old patient might have a negative SHAP value 
of —1.5, meaning this young patient has a better prognosis and 
would live longer. In contrast, a 70-year-old patient might have a 
positive SHAP value of 1.0, indicating that this patient faces a 
higher mortality risk. Hence, age, in this case, is an important 
feature significantly influencing the survival rate of the patient. 
The code below can be run to perform the SHAP interpretability 
analysis. The run time for CPH, GBS, and SSVM is about 30 min 
per model, while RSF requires about 16 h to generate the SHAP 
plot. 


# Initialize JS For Plot 
shap.initjs() 


for i in range(0,4): 
print(’\nModel’, name[i]) 
m = model_list [i] [1] 
m.fit(X_train,y_train) 
explainer = shap.Explainer(m.predict, X_train, feature_names= 
X_train.columns) 
shaps = explainer(X_test) 
shap.summary_plot(shaps, X_test) 


Figure 18 shows the SHAP summary plot for clinical data, 
where a single patient is represented by one data point for each 
feature. The x-axis represents the effect of the features on the 
prediction of the algorithm for a specific observation in the testing 
set, while the y-axis reports the top prognostic predictors in des- 
cending order based on their importance ranking. AGE_AT- 
DIAGNOSIS was found consistently among the four models as 
the top significant factor impacting the survival risk. Specifically, 
higher age is associated with higher mortality risk. The SHAP 
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Fig. 18 SHAP summary plot for clinical data for (a) CPH, (b) RSF, (c) GBS, and (d) SSVM models. For each 
Clinical feature, a single patient is represented by one point. The y-axis lists the top prognostic features and 
presents them in descending order based on their importance ranking provided by the mean of their absolute 
SHAP values. The x-axis reports the SHAP value indicating the impact of the feature on the prediction of the 
algorithm for a specific observation in the testing set. The color represents the value of the feature. The higher 
the SHAP value the patient had, the higher the risk of death or the shorter survival time. AGE_AT_DIAGNOSIS 
was found consistently among the four models as the top significant factors impacting the survival risk 


values and feature ranking are slightly different across the models. 
For example, according to CPH and SSVM, INFERRED_MENO- 
PAUSAL_STATE was the second most important feature asso- 
ciated with survival risk, while NPI was the second most 
important feature in RSF and GBS. In contrast, LATERALITY 
and ER_STATUS had the lowest impact on the outcome of the 
model as they had convergent data points. Hence, we can get a 
holistic picture of the model prediction from the SHAP plots as 
they illustrate the importance of features and their corresponding 
impact on the outcome while determining the value distribution of 
those features in the test set. 
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3.5 Experiment 2: 
Transcriptomic Data 


3.5.1 Load Data 


# Load data 


The second experiment presented in our work consists in conduct- 
ing survival analysis on transcriptomic data following similar steps 
presented in Subheading 3.4. Since transcriptomic data is high- 
dimensional data, feature extraction is highly recommended to 
avoid overfitting and save computational time and resources. 50 fea- 
tures were extracted from the transcriptomic data and used to train 
and evaluate the models. To quickly reproduce the experiment and 
for a more straightforward presentation, we divided this experi- 
ment into two notebooks. The first one includes the preprocessing 
and feature selection steps, where the number of extracted features 
can be easily changed. The training and evaluation of the models 
are performed in the second notebook, where the data extracted 
from the first workbook are explored and used for the ML models. 


First, transcriptomic data was loaded. As it omitted OS_MONTHS 
and OS_STATUS, clinical data was also required to be loaded into 
the data frame to extract the relevant information about survival 
time and status. 


filei=pd.read_csv(’Data/data_clinical_patient.csv’) 


file2=pd.read_csv(’Data/data_mRNA_median_all_sample_Zscores.csv’) 


Then, the first five rows and data information are displayed, as 
shown in Fig. 19. 


# Have a quick look on data 


file2.info() 


file2.head() 


<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 24368 entries, 0 to 24367 
Columns: 1906 entries, Hugo_Symbol to MB-4313 
dtypes: float64(1905), object(1) 

memory usage: 354.4+ MB 


Hugo_Symbol Entrez_Gene_Id MB-0362 MB-0346 MB-0386 MB-0574 MB-0503 


0 RERE 
1 RNF165 
2 CD049690 
3 BC033982 
4 PHF7 


473.0 -0.7082 1.2179 0.0168 -0.4248 0.4916 
494470.0 -0.4419 0.4140 -0.6843 -1.1139 -0.6875 
NaN 0.2236 0.2255 0.5691 0.3545 0.7865 

NaN -2.1485 0.4763 -0.2446 0.2618 -0.2695 
51533.0 -0.3220 -1.0921 0.2830 -0.2864 0.0772 


Fig. 19 Output of transcriptomic data information. The figure presents an overview of the transcriptomic data 
frame, including the total number of entries and the number of columns. There are 24,368 genomic entries 
and 1906 patient columns in the data. As the dataset contains too many columns, the output shows only the 
first 7 columns, while the first five rows are extracted 
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The transcriptomic data included 1906 columns and 24,368 
rows. The first two columns report the gene identifiers in different 
formats, namely Hugo_Symbol and Entrez_Gene_Id, while the rest 
of the columns report the data for the 1904 patients. 


3.5.2 Preprocess Data Hugo Symbols (Hugo_Symbol column) were used in the transcrip- 
tomic data as a stable identifier for genes. Therefore, the Entrez_- 
Gene_Id column was removed from the data frame. 


# Drop unused column 
file2 = file2.drop(’Entrez_Gene_Id’, axis=1) 


We filtered the non-blank values in the Hugo_Symbol column. 
Next, missing and duplicate values were checked and removed 
(step 3 in Fig. 1). A different approach for dealing with duplicates 
in transcriptomic data consists in replacing all the duplicates for 
each gene with their average value. For the sake of simplicity, in this 
tutorial, we decided to simply remove the duplicate values. 


# Drop NA in GeneID 
file2 = file2[file2[’Hugo_Symbol’].notna()] 


# Check null in GeneID columns 
file2[’Hugo_Symbol’].isnull().sum() 


# Check duplicate values 
print(’The number of duplicate values of Hugo_Symbol in data:’, file2[’ 
Hugo_Symbol’].duplicated().sum()) 


# Drop duplicate values for Gene ID 

file2 = file2.drop_duplicates (subset=[’Hugo_Symbol’]) 

print(’After pre-processing, the number of duplicate values of 
Hugo_Symbol:’, file2[’Hugo_Symbol’].duplicated().sum()) 

print(’Shape of Gene data:’, file2.shape) 


Figure 20 shows that initially there were 192 repeated Hugo_- 
Symbol values in our data. After preprocessing, there were no dupli- 
cates, and the final data frame had 24,176 rows (i.e., Hugo 
symbols) and 1905 columns (i.e., 1904 patients and one Hugo 
symbol ID). After eliminating duplicate values, the data frame was 
readily transposed to allow the matching of the Patient IDs in the 
transcriptomic data with those in the clinical data. As shown in 
Fig. 20, the new shape of the data frame was 1904 rows (i.e., 
patients) and 24,177 columns (i.e., 24,176 Hugo symbols and 
one Patient ID column). The first three rows in the new data 
frame were extracted to check the format of the transposed matrix. 


368 Le Minh Thao Doan et al. 


The number of duplicate values of Hugo_Symbol in data: 192 

After pre-processing, the number of duplicate values of Hugo_Symbol in data: 0 
Shape of Gene data: (24176, 1905) 

New shape of Gene data: (1904, 24177) 


PATIENT _ID RERE RNF165 CD049690 BC033982 PHF7 CIDEA PAPD4 Al082173 SLC17A3 


0 MB-0362 -0.7082 -0.4419 0.2236 -2.1485 -0.3220 0.0543 -0.7462 -0.4045 0.7777 
1 MB-0346 ~=—-1.2179 0.4140 0.2255 0.4763 -1.0921 -1.1534 0.0709 0.5118 -0.5187 
2 MB-0386 0.0168 -0.6843 0.5691 -0.2446 0.2830 2.9594 -0.6240 -0.3849 0.6866 


3 rows x 24177 columns 


Fig. 20 Output of transcriptomic data preprocessing. The dataset includes 192 duplicate Hugo symbols. After 
removing duplicate values in the Hugo_Symbo!l column, there are 24,176 rows (i.e., Hugo symbols) and 1905 
columns (i.e., 1904 patients and one Hugo symbol column) in the dataset. We transposed the data frame 
before merging it with the clinical data to retrieve the OS_IMONTHS and OS_STATUS columns, needed for 
performing survival analysis. After transposing the data, the new table contained 1904 patient rows and 
24,177 columns (24,176 Hugo symbols and one patient IDs column). The first three rows of the table are 
shown in the figure 


# Tranpose Patient ID to rows in order to match two data 
file2 = file2.set_index(’Hugo_Symbol’).T.rename_axis(’PATIENT_ID’). 


rename_axis(None, axis=1).reset_index() 
print(’New shape of Gene data:’, file2.shape) 
file2.head(3) 


The new data frame was then merged with the OS_MONTHS 
and OS_STATUS columns from the clinical data based on the Patient 
ID information. The resulting data frame only comprised those 
matched patients between the transcriptomic and clinical tables. 


# Merge gene data with OS time and status 
data = pd.merge(filei[[’PATIENT_ID’,’OS_MONTHS’,’OS_STATUS’]],file2, 
how="inner", on=["PATIENT_ID"]) 


# Have a quick look at data 
data. head () 


In the next step (step 3 in Fig. 1), we checked if the new data 
frame contained any missing values. 


# Check missing values 
print(’Total missing value in the dataset:’, data.isnull().sum().sum()) 


cols_missvalue = data.columns[data.isnull().sum() >0] 
print(’List columns having missing data:’, cols_missvalue) 


According to the output presented in Fig. 21, there were 
10 missing values in the entire data. We replaced those with their 
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Total missing values in the dataset: 10 
List columns having missing data: Index({'TMPRSS7', 'SLC25A19', 'IDO1', 
"CSNK2A1', 'BAMBI', 'MRPL24', 'AK127905', 
'FAM71A'], 
dtype='object') 
After preprocessing, the number of missing values: 0 


Fig. 21 Output of the processing of missing values. There are 10 missing values in the entire data. The Hugo ID 
columns with missing values are TMPRSS7, SLC25A179, IDO1, CSNK2A1, BAMBI, MRPL24, AK127905, and 
FAM71A. As the missing values in the data are numeric, we replace them with their average values in the 
corresponding columns 


average values in the corresponding columns. Several techniques 
have been proposed to handle missing values in transcriptomic 
data, such as k-nearest neighbors imputation, Gaussian mixture 
clustering imputation, and weighted least square imputation [74— 
76]. For simplicity, we replaced the missing values with their aver- 
age values in this tutorial. 


# Deal with missing values 

# Replace missing values with average values 

data[cols_missvalue] = data[cols_missvalue].fillna(data[cols_missvalue 
].mean()) 


# Check missing values again 
print(’After preprocessing, the number of missing values:’, data.isna() 
.sum() .sum() ) 


3.5.3 Feature Selection mRMR was applied to extract the most relevant features from the 
Hugo_Symbol column to be used for the ML models (step 4 in 
Fig. 1). Before employing mRMR, it is recommended to normalize 
the data to boost the performance of the algorithm and save 
computational time. Hence, after removing the survival and patient 
ID information, min—max normalization was implemented to nor- 
malize the transcriptomic data. 


# Normalise data 
ss = MinMaxScaler() 


X_norm = data.drop([’OS_STATUS’, ’OS_MONTHS’,’PATIENT_ID’], axis = 1) 
X_norm pd.DataFrame(ss.fit_transform(X_norm), columns=X_norm. columns) 


For the mRMR algorithm, the number of selected features can 
be easily changed by modifying the value of Kin the code below. In 
this experiment, we extracted 50 features (K = 50) to demonstrate 
how to run the pipeline. The more features are removed, the longer 
is the time required by the mRMR algorithm. For 50 features, the 
model took around 30-45 min to run. After the features were 


370 Le Minh Thao Doan et al. 


extracted, the new data frame was saved to a new CSV file (Gen- 
e_ MRMR_50.csv); this file will be required for the ML process in 
the second notebook. 


# Features extraction 
# Select features using mRMR 
y_mrmr = data[’OS_MONTHS’] 


features_selected = mrmr_classif(X_norm, y_mrmr, K = 50) 
X_mrmr = data[features_selected] 


# Save to csv file 
df_mrmr = X_mrmr 
df_mrmr[’PATIENT_ID’] = data[’PATIENT_ID’] 


df_mrmr.to_csv(’Data/GeneID_MRMR_50.csv’, index=False) 


For easier processing, a new Jupyter notebook was created, and 
the extracted data was loaded to carry on the next steps of the 
analysis. Then, the transcriptomic data of the 50 extracted features 
was merged with the clinical data by Patient_ID. After merging, the 
PATIENT ID column was not relevant for the ML analysis, and it 
was removed from the data frame. Before analyzing the data, the 
OS_STATUS column was encoded to numeric values. The final data 
frame included 1904 rows (patients) and 52 columns (the survival 
time and status of patients, and 50 genes), as shown in Fig. 22. 


# Load data 
filel = pd.read_csv(’Data/data_clinical_patient.csv’) 
file2 = pd.read_csv(’Data/GeneID_MRMR_50.csv’) 


# Merge gene data with OS time and status 
data = pd.merge(filei[[’PATIENT_ID’,’OS_MONTHS’ ,’OS_STATUS’]],file2, 
how="inner", on=["PATIENT_ID"]) 


# Preprocess data 

# Drop unused columns 

drop_list = [’PATIENT_ID’] 

df = data.drop(drop_list, axis=1) 

print(’After the first preprocessing, the shape of data is’, df.shape) 
# Encode OS status to dummy 

df[’OS_STATUS’] = np.where(df[’OS_STATUS’] == °1:DECEASED’, 1, 0) 


After cleaning, the shape of data is (1904, 52) 
Missing value number: 0 


Fig. 22 Output of preprocessing step. The figure shows that there were 1904 
rows and 52 columns in the final data frame. The columns in the final dataset 
comprised OS_MONTHS, OS_STATUS, and 50 transcriptomic extracted feature 
columns. No missing values were found in the data 
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Fig. 23 Correlation matrix of 50 gene expression features. The correlation matrix depicts the linear correlation 
between all the pairs of attributes and ranges from —1 (perfect negative correlation) to +1 (perfect positive 
correlation), with the value of zero representing no correlation between the features. Color density represents 
the values of the correlation, where a darker color implies higher values and a lighter color implies the lower 
ones. There were no highly correlated features observed in the data 


Once the preprocessing steps (step 3 in Fig. 1) on the data were 
completed, we conducted a correlation analysis and plotted the 
followed-up survival-time distribution to investigate the data, as 
displayed in Figs. 23 and 24. No high correlated features were 
found in the selected transcriptomic data. 
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Fig. 24 Distribution of follow-up times of censored and observed (death) events associated with the 
transcriptomic data selected by mRMR. The data contained 42.1% of the censored observations. The 
distribution is right-skewed, and it is different between censored patients and those who experienced the 
event. The censored group has more patients with longer survival times 


# Correlation analysis 
colormap = plt.cm.Reds 
plt.figure(figsize=(8,8) ) 
sns.heatmap(df.corr() ,linewidths=0.1,vmax=0.8, 
square=True, cmap = colormap, linecolor=’white’) 
plt.title(’Correlation matrix’, fontsize=14) 
plt.show() 


# Time Distribution of Death and Censored 

num_censored = df.shape[0] - df["OS_STATUS"].sum() 

print("\%.1f£\%\% of records are censored" \% (num_censored/df.shape 
[0] *100) ) 


plt.figure(figsize=(10, 6)) 
val, bins, patches = 
plt.hist((df.query(’OS_STATUS == 1’)[’OS_MONTHS’], 
df.query(’OS_STATUS == 0°’)[’OS_MONTHS’]), 
bins=30, stacked=True) 
. = plt.legend(patches, ["Time of Death", "Time of Censored"]) 
plt.title("Time Distribution of Censored and Death Patients") 


3.5.4 Plot Cox 
Proportional Hazards Mode! 


3.5.5 Set Up and 
Evaluate Machine Learning 
Algorithms 
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The rest of Experiment 2, including plotting the CPH model, 
preparing and evaluating the ML algorithms, and interpreting the 
results, was set up as previously done in Experiment 1 in Subhead- 
ings 3.4.3, 3.4.4, and 3.4.5. Since the code to run the analysis in 
the sections below is the same as the one shown in Subheadings 
3.4.3, 3.4.4, and 3.4.5, we do not repeat it below. However, the 
complete notebook can be accessed at https://github.com/ 
Angione-Lab/survival_analysis_tutorial. 


Before fitting the CPH model (step 5 in Fig. 1), the data was 
normalized applying the min-max method (as shown in Subhead- 
ing 3.4.3). Then, the log(HR) values were plotted as shown in 
Fig. 25, and the statistical report, including HR with a 95% confi- 
dence interval and log-rank p-values, was generated. Genes 
LCNI15, OTOS, and INSM2 were identified as the top three most 
significant factors associated with a high probability of experiencing 
the event of interest (i.e., death), while genes MATN/ and KPRP 
were negatively associated with the death event (as shown by their 
negative log(HR) values). As shown in Fig. 26, the overall C-index 
of this model is 0.574, which shows an acceptable predictive model. 


After visualizing the results of the CPH model, we performed the 
same steps as Subheading 3.4.4 to build and evaluate the ML 
models (step 6 in Fig. 1). The following steps were applied: 


1. Data was split into training (80%) and testing sets (20%). 


2. Data was normalized using the min—max normalization. Nor- 
malized data was used to fit and train the four predictive 
algorithms. 


3. Fivefold cross-validation with grid search was used for tuning 
the hyperparameters and _ selecting the optimal 
hyperparameters. 


4. The models were evaluated on the testing set, and the full 
process (steps 1-3) was repeated 20 times to obtain the aver- 
age C-index. 

5. Finally, patients in the testing set were ranked in descending 
order based on their predicted risk scores and split into two 
groups according to the median values. The comparisons 
between the two groups (high-risk and low-risk groups) for 
all the four algorithms were performed using Kaplan—Meier 
curves and log-rank test. 


The details of the setup and evaluation of ML model’s codes are 
the same as in Subheading 3.4.4. The outcomes of the four algo- 
rithms are presented in Figs. 27 and 28. RSF had the highest 
C-index value of 0.53, followed by GBS, SSVM, and CPH. 
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Fig. 25 Results of the CPH model applied to transcriptomic data. The genes LCN15, OTOS, and INSM2 were 
found as the top three most significant factors associated with low survival with HR values of 1.643, 1.556, 
and 1.507, respectively. Hence, patients having higher values of these three predictors are more likely to have 
a shorter survival time. In contrast, the less than zero log(HR) value predictors (i.e., HR less than one), such as 
MATN7 and KPRP, were negatively associated with the death event. Patients with higher values of these genes 
tend to live longer compared to those who have lower expression values of the same genes 


The Kaplan—Meier curves for breast cancer patients in the 
testing set according to their predicted prognostic score using 
50 features revealed that only the CPH model reported a significant 
difference in the survival distributions of high-risk and low-risk 
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Fig. 26 Cox proportional hazards report for transcriptomic data. The report indicates that OS_MONTHS was the 
duration variable, while OS_STATUS was the event variable used for survival analysis. The figure also reports 
the HR values (exp(coef)), with the corresponding 95% confident interval, and p-values of the 50 extracted 
features. The accuracy prediction of the CPH model, i.e., the C-index, was 0.574, which indicates an 
acceptable model. Similar to the results presented in Fig. 25, the genes LCN15, OTOS, and INSM2 were 
identified as the top three most significant factors associated with the death event with a p-value less than 
0.05, and coefficient/log(HR) values of 1.643, 1.556, and 1.507, respectively 


patients with a p-value of 0.009, as shown in Fig. 29. This result 
might be due to the number of features selected in the preproces- 
sing steps. For this reason, several approaches have been proposed 
to select the optimal subset of features and achieve more accurate 
and robust results [77-79 ]. 


3.5.6 Interpret Model In order to provide an interpretation of the results of models (step 
7 Fig. 1), we computed and plotted the SHAP values for all the 
features in the test set. The SHAP values for the top 20 features are 
shown in Fig. 30. A single patient is represented by each data point. 
The y-axis lists the top 20 most influential genes in descending 
order, represented by its Hugo symbol. The x-axis reports the 
corresponding SHAP values for a specific observation in the testing 
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Fig. 27 C-index comparisons for Experiment 2. Boxplots of C-index results of transcriptomic data using CPH, 
RSF, GBS, and SSVM. The experiments were replicated 20 times. In each experiment, the data was randomly 
divided into training and testing sets with a ratio of 80:20 while guaranteeing the same censoring percentage 
on each set of data. RSF was found to have the highest median C-index, followed by GBS, SSVM, and CPH 


3.6 Experiment 3: 
Integrating Clinical to 
Transcriptomic Data 


set. The higher the SHAP value, the higher the mortality risk of the 
patient represented by the data point. The color represents low or 
high gene expression values. Particularly, the genes LCNI15 and 
AA625691 were identified among the top 10 features by the four 
models. OTOS was selected by CPH, RSF, and SSVM. High values 
of this gene had a positive impact on the models’ outcome (i.e., 
high values of this gene correlate with higher risk of experiencing 
the event of interest). All these genes were associated with patient 
survival and could represent useful prognostic biomarkers for breast 
cancer patients. Most of the gene features in the CPH model were 
convergent and had SHAP value distribution around 0, indicating 
no significant influence on the outcome of the model. 


In the last experiment presented in our tutorial, clinic information 
(from Experiment 1) and transcriptomic data (from Experiment 2) 
were integrated to improve the predictive power of the ML models. 
The workflow followed in this experiment is similar to the one 
previously followed in Subheadings 3.4 and 3.5. First, data was 
loaded and cleaned before performing EDA. Next, the CPH results 
were visualized, and the reports were extracted for further analysis. 
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Cox Regression 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', CoxPHSurvivalAnalysis(alpha=0.001))]) 
C-index for test set (Hold out): 0.5157661773365008 
Average C-index for 20 runs 0.5149500000000001 


Random Forest Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 

('model', 

RandomSurvivalForest (max_depth=8, max_features='sqrt', 
min_samples_leaf=50, 
min_samples_split=100, n_estimators=500, 
random_state=5))]) 

C-index for test set (Hold out): 0.5301415199691377 
Average C-index for 20 runs 0.53135 


Gradient Boosting Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 
GradientBoostingSurvivalAnalysis(n_estimators=1000, 
random_state=5))]) 
C-index for test set (Hold out): 0.5120302125845161 
Average C-index for 20 runs 0.52285 


SVM Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 
FastSurvivalSVM(max_iter=500, optimizer='avltree', 
random_state=5))]) 
C-index for test set (Hold out): 0.4973096992954458 
Average C-index for 20 runs 0.5169000000000001 


Fig. 28 ML model’s results for 50 selected genes on the transcriptomic data. The selected hyperparameters, 
initial test results, and the average C-index of each model are displayed in the output. Overall, the average 
performance over 20 runs of the three ML-based models outperformed the CPH model on the analysis of 
transcriptomic data for survival prediction. RSF had the highest average C-index with a value of 0.530, 
followed by GBS, SSVM, and CPH 


Then, we prepared the data for survival analysis and constructed the 
ML models for training and evaluating models performance. 
Finally, the outcomes were interpreted to identify the important 
markers associated with low survival. 


3.6.1 Load Data The data used for this experiment is derived from the data already 
used for Experiment 1 (Subheading 3.4) and Experiment 2 (Sub- 
heading 3.5), i.e., the encoded clinical data and the 50 Hugo 
Symbol extracted using mRMR. There were 1977 and 1903 obser- 
vations in the preprocessed clinical data and transcriptomic data, 
respectively. We extracted the matching observations between these 
two datasets and used them in this experiment. 
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Fig. 29 Kaplan—Meier curves to compare high-risk and low-risk breast cancer groups, stratified by predicted 
survival risk score based on the transcriptomic data when using 50 features. The low-risk group includes 
patients with predicted risk scores above the median value, while the high-risk group comprises patients with 
predicted risk scores lower than the median value. The p-value from the log-rank test was calculated to 
statistically determine the difference in survival functions between the two groups. The figure shows that only 
the CPH model showed a statistically significant difference between risk groups with a p-value of 0.009 


# Load data 
filel pd.read_csv(’Data\clinical.csv’) 
file2 pd.read_csv(’Data\GeneID_MRMR_50.csv’) 


# Merge clinical data 
data = pd.merge(filei,file2, how="inner", on=["PATIENT_ID"]) 
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Fig. 30 SHAP summary plot for transcriptomic data for (a) CPH, (b) RSF, (c) GBS, and (d) SSVM models. For 
each gene feature, a single patient is represented by each data point. The y-axis lists the top 20 prognostic 
biomarkers and presents them in descending order based on the ranking provided by the mean of their 
absolute SHAP values. The x-axis reports the SHAP value indicating the impact of the feature on the prediction 
of the algorithm for a specific observation in the testing set. The color represents the value of the feature for 
each patient. The higher the SHAP value the patient had, the higher the risk of death. The genes OTOS and 
AA625691 were respectively found as the most important predictors for the CPH and SSVM models, while 
LCN15 was identified as the top most significant feature for the RSF and GBS models 
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Then, an overview of the information and the first five rows of 
the merged data frame was displayed. As shown in Fig. 31, there 
were 79 columns and 1903 rows in the new data frame. No missing 
values were found in the final dataset. 


<class 'pandas.core.frame.DataFrame'> 
Int64Index: 1903 entries, 0 to 1902 
Data columns (total 79 columns): 


# Column Non-Null Count Dtype 


0) CELLULARITY 1903 non-null float64 
1 HER2_SNP6 1903 non-null float64 
2 INFERRED_MENOPAUSAL_STATE 1903 non-null float64 
3 INTCLUST 1903 non-null float64 
4 THREEGENE 1903 non-null float64 
5 CHEMOTHERAPY 1903 non-null float64 
6 ER_IHC 1903 non-null float64 
7 HORMONE_THERAPY 1903 non-null float64 
8 CLAUDIN_SUBTYPE 1903 non-null float64 
9 LATERALITY 1903 non-null float64 
10 RADIO THERAPY 1903 non-null float64 
11 HISTOLOGICAL_SUBTYPE 1903 non-null float64 
12 BREAST_SURGERY 1903 non-null float64 
13 CANCER_TYPE_ DETAILED 1903 non-null float64 
14 ER_STATUS 1903 non-null float64 
15 HER2_STATUS 1903 non-null float64 
16 ONCOTREE_CODE 1903 non-null float64 
17 PR_STATUS 1903 non-null float64 
18 LYMPH _NODES_EXAMINED POSITIVE 1903 non-null float64 
19 NPI 1903 non-null float64 
20 AGE_AT_DIAGNOSIS 1903 non-null float64 
21 COHORT 1903 non-null float64 
22 GRADE 1903 non-null float64 
23 TUMOR_SIZE 1903 non-null float64 
24 TUMOR_STAGE 1903 non-null float64 


Fig. 31 Output of the integrated clinical and transcriptomic data information. The output gives an overview of 
the merged data frame, including total entries, data types, names of columns, and the number of validated 
data points. There are 1903 entries and 79 columns in the merged data frame. The first 25 features are shown 
in this figure. There are no missing values in the final dataset 
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The number of duplicate values in data 0 
After the first preprocessing, the shape of data is (1903, 78) 
Missing value number: 0 


Fig. 32 Output of the integrated clinical and transcriptomic data preprocessing. After removing PATIENT_ID 
column, the new dataset consists of 1903 rows and 78 columns. There are no duplicates and missing values 
in the final dataset 


# Have a quick look at data 


data. info() 
data. head () 


3.6.2 Preprocess and For this experiment, it is optional to check the duplicate and 

Explore Data missing values since the data was already processed in the previous 
two experiments. However, it is always a good practice to conduct 
the preprocessing step after loading data (step 3 in Fig. 1). The 
PATIENT ID column was removed from the dataset before 
moving into the next step of the pipeline. As shown in Fig. 32, 
the new shape of the data was 1903 rows and 78 columns. No 
duplicates and missing values were found in the final dataset. 


# Preprocess data & Explore data 
# Check duplicate values 
print(’The number of duplicate values in data’, data.duplicated().sum() 


) 


# Drop unused cols: Based on data.info(), we will drop some unused cols 
and null cols 

drop_list = [’PATIENT_ID’] 

df = data.drop(drop_list, axis=1) 

print(’After the first preprocessing, the shape of data is’, data.shape 


) 


# Check missing values again 
print(’Missing value number:’, df.isna().sum().sum()) 


Next, the correlation matrix was plotted to provide more 
insights into the relationships of features in the merged dataset 
(Fig. 33). Except for some pair of clinical features such as ER_STA- 
TUS and ER_IHC, no high correlated values were observed 
between clinical and transcriptomic features. 


# Correlation analysis 
colormap = plt.cm.Reds 
plt.figure(figsize=(15,15)) 
sns.heatmap(df.corr() ,linewidths=0.1,vmax=0.8, 
square=True, cmap = colormap, linecolor=’white’) 
plt.title(’Correlation matrix’, fontsize=14) 
plt .show() 
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Time Distribution for Censored and Observed Events 
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Fig. 34 Distribution of follow-up times of censored and observed (death) events in the integrated clinical and 
transcriptomic data. The data contains 42% of the censored observations. The distribution is right-skewed, 
and it is different between censored patients and those who experienced an event. The censored group has 
more patients with longer survival times 


In the next step, the follow-up survival-time distribution was 
plotted. Overall, the time distribution plot for this experiment was 
similar to the one observed in Experiment 2 and shown in Fig. 34. 
There were 42% of the censored observations in the integrated 
clinical and transcriptomic data. 


# Time Distribution of Death and Censored 
num_censored = df.shape[0] - df["OS_STATUS"].sum() 
print("%.1£%% of records are censored" % (num_censored/df.shape [0]*100) 


) 


plt.figure(figsize=(10, 6)) 
val, bins, patches = 
plt.hist ((df.query(’OS_STATUS == 1’)[’OS_MONTHS’], 
df .query(’OS_STATUS == 0°)[’OS_MONTHS’]), 
bins=30, stacked=True) 
. = plt.legend(patches, ["Time of Death", "Time of Censored"]) 
plt.title("Time Distribution of Censored and Death Patients") 


The rest of Experiment 3 includes plotting the CPH model, 
preparing and evaluating the ML algorithms, and interpreting the 
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3.6.3 Plot Cox 
Proportional Hazards Mode! 


3.6.4 Set Up and 
Evaluate Machine Learning 
Algorithms 


3.6.5 Interpret Model 


results. These were set up as previously done in Experiment 1 in 
Subheadings 3.4.3, 3.4.4, and 3.4.5. Hence, we do not repeat the 
code in the sections below, but we discuss and interpret the results. 
However, the complete notebook can be accessed at https:// 
github.com/Angione-Lab /survival_analysis_tutorial. 


Following the same approach presented in Subheading 3.4.3 in 
Experiment | (step 5 in Fig. 1), the merged data was normalized, 
and the CPH model was fitted to generate the log(HR)s and the 
final statistical report. Figure 35 shows that AGE_AT_DIAGNO- 
SIS was identified as the most significant factor associated with a 
higher probability of experiencing the event, with a log(HR) value 
of 3.800. In contrast, the genes ECELJ and KPRP were found 
negatively associated with the death event as shown by their nega- 
tive log(HR) values. Patients having higher expression values for 
these two genes tend to live longer compared to those who show 
lower expression values of the same genes. 


Following the same steps (step 6 in Fig. 1) described in Subheading 
3.4.4, data was prepared by splitting it into training and testing sets. 
The experiment pipeline was the same as in Experiments 1 and 
2. First, the ML models were trained using a grid search approach 
with fivefold cross-validation to identify the optimal hyperpara- 
meters. Next, the fitted models were evaluated on the testing set. 
The data was split into training and testing sets, and the evaluation 
was repeated 20 times to obtain an average C-index. Finally, 
Kaplan—Meier curves and log-rank tests were applied to statistically 
compare the differences between the two predicted risk groups. 
The high-risk group contained the patients in the testing set with 
the expected risk scores above the median value. In contrast, the 
low-risk group included the patients with the expected risk score 
below the median value. This process helped us to identify the 
optimal algorithm that could successfully estimate the survival risk 
of the cancer patient. 

Figures 36 and 37 show the results of Experiment 3. RSF was 
the best performing model with an average C-index value of 0.683, 
followed by GBS, SSVM, and CPH, with C-indices equal to 0.675, 
0.673, 0.670, respectively. Kaplan-Meier curves, reported in 
Fig. 38, show that all four models had statistically significant differ- 
ences in survival distributions between risk groups. RSF was again 
the best survival algorithm with the lowest p-value of 1.197E-14. 


For the final step of the pipeline (step 7 in Fig. 38), SHAP values 
were used to interpret the results of the models. The SHAP plots of 
the top 20 most important features are reported in Fig. 39. A single 
patient is represented by each data point for each feature. The y-axis 
lists the top 20 most influential features in descending order, while 
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Fig. 35 Results of the CPH model applied to the merged clinical and transcriptomic dataset. AGE_AT_DIAG- 
NOSIS was identified as the most significant factors associated with the death events with a log(HR) value of 
3.800, followed by LYMPH_NODES_EXAMINED_POSITIVE, genes KRT1, OTOS, and ACTC7. In contrast, the 
negative HR value predictors, such as gene ECEL7 and KPRP, were negatively associated with the death event. 
Patients with higher values of these factors tend to live longer compared to those who have lower expression 
values of those genes 
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Fig. 36 C-index comparisons for Experiment 3. Boxplots of C-index results of the integrated clinical and 
transcriptomic data using CPH, RSF, GBS, and SSVM. The experiments were replicated 20 times. In each 
experiment, the data was randomly divided into training and testing sets with a ratio of 80:20 while 
guaranteeing the same censoring percentage in each splitting. On average, RSF was found to have the 
highest median C-index, followed by GBS, SSVM, and CPH 


the x-axis reports their corresponding SHAP values for a specific 
observation in the testing set. The higher the SHAP value asso- 
ciated with a patient, the higher the mortality risk that the patient 
would have. AGE_AT_DIAGNOSIS was selected among the top 
features by the four models, suggesting the high impact of this 
feature on the survival outcome. Figure 39 also identifies some 
critical biomarkers affecting the prediction outcomes of algorithms, 
including genes ERAS, SLCI4A1, and LCN15. Specifically, high 
values of ERAS and SLC14A/1 had a negative impact on the out- 
come of the models (i.e., high expression values of these genes 
correlated negatively with the probability of experiencing the 
event), while high values of LCNI5 showed a positive impact on 
the outcome of models (i.e., high values of this gene correlated 
positively with the probability of experiencing the event). All these 
genes were associated with patient survival and could be useful 
prognostic biomarkers for breast cancer patients. 
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Cox Regression 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', CoxPHSurvivalAnalysis(alpha=1))]) 
C-index for test set (Hold out): 0.6594013249366157 
Average C-index for 20 runs 0.66945 


Random Forest Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 

('model', 

RandomSurvivalForest (max_depth=8, max_features='sqrt', 
min_samples_leaf=50, 
min_samples_split=100, n_estimators=500, 
random_state=5))]) 

C-index for test set (Hold out): 0.6730187290422834 
Average C-index for 20 runs 0.6833 


Gradient Boosting Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 
GradientBoostingSurvivalAnalysis (learning rate=0.01, 
n_estimators=800, 
random_state=5))]) 
C-index for test set (Hold out): 0.6474809847059786 
Average C-index for 20 runs 0.67505 


SVM Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 
FastSurvivalSVM(max_iter=500, optimizer='avltree', 
random_state=5))]) 
C-index for test set (Hold out): 0.6683978081295494 
Average C-index for 20 runs 0.6729499999999999 


Fig. 37 ML model results for the integration clinical and transcriptomic data. The selected hyperparameter, 
initial test result, and average C-index of each model are displayed in the outcome. Overall, the average 
performance over 20 runs of the three ML models outperformed CPH when using the integrated data for 
survival prediction. RSF had the highest average C-index with a value of 0.683, followed by GBS, SSVM, 
and CPH 


4 Conclusions 


The application of ML models on the integration of clinical and 
omics data can provide data insights to improve personalized treat- 
ment and precision oncology. However, there are still some chal- 
lenges to overcome, mainly related to the high dimensionality of 
the data and the heterogeneity of samples. Hence, better 
approaches to develop accurate predictive models and identify crit- 
ical prognostic markers need to be implemented. In this tutorial, we 
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Fig. 38 Kaplan—Meier curves to compare high-risk and low-risk groups, stratified by predicted survival risk 
scores. The low-risk group (n = 190) included patients with predicted risk scores above the median value, 
while the high-risk group (n = 190) comprised patients with risk scores calculated to determine the statistical 
difference between the survival distributions of the two groups. The figure shows statistically significant 
differences between survival groups for all four models with a p-value lower than 0.0001 


showed that ML models appear as salient and successful methods to 
analyze medical data and predict patient-specific survival outcomes. 
Our study proposed a step-by-step protocol to design and evaluate 
the traditional statistical model CPH and three ML models for 
breast cancer survival, i.e., RSF, GBS, and SSVM. The performance 
of the ML models was assessed using the METABRIC dataset. The 
presented pipeline, based on optimizing C-index by using a grid 
search approach and a fivefold cross-validation method, has a great 
potential to improve the performance of models and generalize the 


Machine Learning Methods for Survival Analysis of Breast Cancer 389 


(a) CPH (b) RSF 
a epeenpegdpnamee cen o-——-- - ———— 
Hi--« @&- & --—- -— 
a i .- _- ——= 
i. — @- 
vel —- + 
-otfte - 
ts = 
ie = 
i “ll R Sd 
> wn ” 
> - 
> sh 
5 ® 
+> e 
— sd 
ee b 
+ s 
+ e 
(c) GBS (d) SSVM 
f DIE tet eles aes ao saat 
-o—--—- ez 
— -—- Lt 
— 1. 
+> wee 
| tte 
qe ot. 
oa ell 
t-: Ls 
7 nu 
ode i eral 
AAS —+>--- afore 
° 4 wo 
+ n “tl 
® > 
+: - > 
sia > 
' —_ 
aes = 
+ + 


pact on model output) 


Fig. 39 SHAP summary plot for the integrated clinical and transcriptomic data for (a) CPH, (b) RSF, (c) GBS, 
and (d) SSVM models. For each gene feature, a single patient is represented by each data point. The y-axis 
lists the top prognostic biomarkers and presents them in descending order based on their ranking provided by 
the mean of their absolute SHAP values. The x-axis reports the SHAP value indicating the impact of the feature 
on the algorithm’s prediction outcome for a specific observation in the testing set. The color represents the 
value of the feature for each instance. The higher the SHAP value associated with the patient, the higher the 
risk of death. For the survival risk predictors, AGE_AT_DIAGNOSIS was consistently selected by the four 
models as the top significant factors impacting the outcome of models. Specifically, high values of this feature 
correlated with a higher probability of experiencing the event. Other biomarkers such as ERAS, SLC74A7, and 
LCN75 were identified as features having a high impact on the prediction outcomes and associated with the 
predicted survival likelihood 


models for survival prediction on unseen data. Furthermore, we 
used SHAP values to interpret the model results and identify the 
features that had the highest impact on the prediction outcomes of 
models. The improvement in ML interpretability will help research- 
ers and clinicians understand more about ML models and thus gain 
more credibility and trust. This tutorial represents one step further 
to bring these novel solutions to clinicians and to the public. Our 


390 Le Minh Thao Doan et al. 


Acknowledgements 


work offers an exploratory strategy to enhance the biological 
understanding of the prognosis predictive ML models. 

We conducted three different experiments for clinical data, 
transcriptomic data, as well as the integration of these two data 
types. Incorporating clinical and mRNA expression data is crucial 
to uncover a sequence of complicated interactions in multiple 
biological processes and complex human conditions. Due to the 
high-dimensional nature of transcriptomic data, mRMR was 
applied as a feature selection technique. This preprocessing step 
also helps to boost the performance of models, save computational 
resources, and reduce overfitting. 

Even if we presented the most used ML techniques to perform 
survival analysis on different types of data, there are some limita- 
tions to this tutorial. We only considered three ML algorithms, 
namely RSF, GBS, and SSVM, because of their popularity and 
effectiveness in analyzing survival data. However, other approaches 
based on deep learning, a branch of ML, have also proved their 
capability to work with survival data. Some packages are available to 
run deep-learning-based models for cancer prognosis, such as 
DeepSurv [80], Cox-nnet [43], and DeepProg [81]. A competitive 
performance comparison between our approaches to other deep- 
learning-based models could enable researchers to explore and 
obtain optimal ways to supplement conventional survival analysis 
techniques. 

The number of features selected in our study could also have 
limited the findings when using transcriptomic data. To save com- 
putation time and resources, we only extracted 50 features to 
demonstrate our approach. Future studies could adopt our frame- 
work and repeat our steps exploring different numbers of features. 

In summary, by performing survival analysis across different 
models and data, our results revealed that ML approaches were 
capable of generating accurate prognostic predictions. The 
ML-based models showed a better performance compared to tradi- 
tional statistic methods, i.e., CPH model. Particularly, RSF 
reported the best performance results in analyzing the transcrip- 
tomic data (Experiment 2) and the integrated clinical and transcrip- 
tomic data (Experiment 3), while SSVM was the best performing 
model when using clinical data only (Experiment 1). 
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Machine Learning Using Neural Networks for Metabolomic 
Pathway Analyses 


Rosalin Bonetta Valentino, Jean-Paul Ebejer, and Gianluca Valentino 


Abstract 


Elucidating the mechanisms of metabolic pathways helps us understand the cascade of enzyme-catalyzed 
reactions that lead to the conversion of substances into final products. This has implications for predicting 
how newly synthesized compounds will affect a person’s metabolism and, hence, the development of novel 
treatments to improve one’s health. The study of metabolomic pathways, together with protein engineer- 
ing, may also aid in the extraction, at a scale, of natural products to be used as drugs and drug precursors. 
Several approaches have been used to correlate protein annotations to metabolic pathways in order to derive 
pathways directly related to specific organisms. These could range from association rule-mining techniques 
to machine learning methods such as decision trees, naive Bayes, logistic regression, and ensemble methods. 
In this chapter, we will be reviewing the use of machine learning for metabolic pathway analyses, with a 
step-by-step focus on the use of deep learning to predict the association of compounds (metabolites) to 
their respective metabolomic pathway classes. This prediction could help explain interactions of small 
molecules in organisms. Inspired by the work of Baranwal et al. (2019), we demonstrate how to build 
and train a deep learning neural network model to perform a multi-label prediction. We considered two 
different types of fingerprints as features (inputs to the model). The output of the model is the set of 
metabolic pathway classes (from the KEGG dataset) in which the input molecule participates. We will walk 
through the various steps of this process, including data collection, feature engineering, model selection, 
training, and evaluation. This model-building and evaluation process may be easily transferred to other 
domains of interest. All the source code used in this chapter is made publicly available at https: //github. 
com/jp-um/machine_learning_for_metabolomic_pathway_analyses. 


Key words Metabolomics, Machine learning, Neural networks, KEGG classes, Feature engineering, 
Performance metrics 


1. Introduction 


1.1 Metabolomics Metabolomics involves the extensive analysis of metabolites con- 
sisting of small molecules (<1 kDa) in an organism or a particular 
biological sample. Such analysis depends on the myriad of biochem- 
ical knowledge attained over the last decades [1]. 

This area of research focuses on the intermediates and products 
of metabolism. These include fatty acids, carbohydrates, 
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nucleotides, amino acids, antioxidants, and vitamins among other 
compounds. The metabolome includes all the metabolites that are 
synthesized by a biological system and can be characterized at all 
biological levels including organelles, cells, tissues, and 
organisms [2]. 

Numerous factors have been identified to affect metabolite 
levels within tissues and biological fluids, making the metabolome 
as a whole susceptible to fluctuations determined by genetic and 
environmental factors, gut microflora, as well as enzyme activity. 
Therefore, metabolomics gives us an indication of cell and ulti- 
mately organism health [3, 4]. 

As opposed to genomics, transcriptomics, or proteomics, meta- 
bolomics aims to give us an explanation of the response of organ- 
isms to physiological and pathophysiological stimuli. Hence, 
interest in this field of research has risen exponentially in recent 
years. Metabolomics, therefore, allows us to develop an under- 
standing of the effect of genetic variation, disease, treatment, or 
diet exerted on metabolic state of organisms [5, 6]. The analytical 
methods used in metabolomics mainly include nuclear magnetic 
resonance (NMR) spectroscopy and mass _ spectrometry 
(MS) [7, 8]. Such spectroscopic techniques allow for the analysis 
of numerous small molecules in a sample, and this may involve the 
identification and quantitation of metabolites. 

The three main research approaches taken in metabolomics 
may include metabolic fingerprinting, metabolite profiling, and 
targeted metabolomics. Metabolic fingerprinting is the rapid eval- 
uation of the reproducible metabolite fingerprint of a biological 
sample. The metabolic fingerprint can be considered as the concen- 
tration of metabolites in a sample at a point in time. In this case, 
metabolite identification is not necessary. The aim of the fingerprint 
is to represent numerous compound classes which may be poten- 
tially interesting for applications such as drug discovery. The meta- 
bolites are not known in advance in this case. Metabolic 
fingerprinting does not require any advanced sample preparation 
or chromatographic resolution techniques. Instead it makes use of 
techniques which provide reproducible data. Metabolic fingerprint- 
ing is mostly used for classifying a sample rather than for quantita- 
tive analysis. This may be, for example, to distinguish between 
specimens in a healthy or disease state [9, 10]. 

Metabolite profiling consists of an approach in metabolomics 
which is non-targeted and includes analyzing a vast range of meta- 
bolites without knowing which compounds would be of interest in 
advance. As opposed to fingerprinting, the scope of profiling is to 
identify as well as quantify as many compounds of interest as 
possible via high-throughput metabolite quantification. The latter 
requires chromatographic separation at a high resolution coupled 
with mass spectrometry to enable the detection of new metabolic 
biomarkers [11, 12]. 


1.2 Machine 
Learning 
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Targeted metabolomics consists of analyzing one or more 
metabolites which are predefined and usually allows for their iden- 
tification and quantity within a sample. These compounds are 
selected in advance, depending on their particular metabolic path- 
ways or biomarkers that would be related to a specific reaction in 
the organism. In this case, analytical techniques that provide a high 
sensitivity and selectivity are used to attain low detection limits of 
metabolites [11, 12]. 

Metabolomic research has a variety of health applications which 
range from toxicology, newborn screening, and pharmacology to 
clinical chemistry. Currently, metabolomics is employed to find new 
biomarkers of numerous diseases and to highlight the biochemical 
pathways contributing to their pathogenesis [1, 6, 7]. Besides 
identifying new diagnostic biomarkers which can be utilized to 
detect a disease at an early stage, metabolomic research can be 
applied to find biomarkers that can be used to select an appropriate 
therapy and subsequently evaluate the outcome to the particular 
treatment applied. Thus, metabolomics can be used as a tool in the 
development of personalized medicine. 

This chapter is structured as follows: we first introduce machine 
learning and a particular branch known as deep learning, as well as 
one of the most commonly used architectures — neural networks. 
Applications of deep learning to metabolomics in the literature are 
reviewed. The neural network training procedure which is per- 
formed through backpropagation is also covered. Through an illus- 
trative example, we will then demonstrate how to build and train a 
deep learning neural network to predict the association of metabo- 
lites to their respective metabolomic pathway classes. Like many 
problems in biology, this is a multi-label problem and is well-suited 
to the use of neural networks. 


Machine learning is broadly defined as the capability of an algo- 
rithm to learn a model which can perform some task successfully 
given some performance metric. Supervised learning is a particular 
learning paradigm in which the algorithm is provided with a labeled 
dataset of correct input-output pairs and learns to predict the 
correct output given an input. Other learning paradigms include 
unsupervised learning, in which the ground truth output is not 
provided, and the algorithm therefore needs to discover some 
underlying pattern in the available data, as well as reinforcement 
learning, in which an agent explores an environment according to 
some policy in order to earn rewards. 

An artificial neural network (ANN) [13] is a type of machine 
learning model which can be trained using supervised learning. 
ANNs loosely mimic the behavior of a biological neural network, 
in which neurons are interconnected and are triggered depending 
on the input signal provided and an activation function. Weights are 
associated with each connection between two neurons. The 
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1.3 ML Applications 
in Metabolomics 


2 Methods 


2.1 Neural Network 


network architecture follows a sequential structure, in which an 
input signal is propagated throughout the network in a feed- 
forward manner until it reaches the output. Deep learning [14] is 
a term used to describe the training of large neural networks which 
have many neurons and several hidden layers. The purpose of these 
hidden layers is to allow the network to learn nonlinear and convo- 
luted mappings, as well as to extract features from the input data. 


Deep learning has been increasingly used for problems in metabo- 
lomics, which are difficult to solve with conventional algorithms. 
For example, in nuclear magnetic resonance (NMR) and mass 
spectroscopy (MS)-based metabolomics, a variety of ML algo- 
rithms have been developed for data preprocessing, peak identifica- 
tion, peak integration, compound identification/quantification, 
data analysis, and data integration [15-19]. In particular, Baranwal 
et al. [20] use graph convolutional neural networks to extract 
molecular shape features which are then fed to a random forest 
classifier to predict the pathway class for a given molecule. Never- 
theless, the number of deep learning-associated publications in 
metabolomics is still significantly lower than all other omics [21]. 
The uptake of deep learning and neural networks is also increas- 
ing within the metabolomic community due to the availability of 
programming languages such as Python [22], R [23], and 
MATLAB [24] as well as frameworks such as TensorFlow [25], 
Keras [26], PyTorch [27], and scikit-learn [28]. These frameworks 
are designed to run on graphics processing units (GPUs), which 
can parallelize complex tasks (e.g., matrix multiplication) and are 
readily available in desktop computers and computing clusters. 


A neural network is represented using a connected graph of neu- 
rons. An example is shown in Fig. 1, where a neuron Y receives 
inputs from neurons X, Xz, and X3. The outputs from these three 
neurons are x), x), and «3. The net input y,, to the neuron Yis the 
sum of the weighted signals from these neurons: 
Jin = WX] + 2X2 + wW3x3. Further neurons may be then connected 
to Y. The output from Y is your = fin), where f is called the 
activation function. 

A set of commonly used activation functions is shown in Fig. 2. 
In order to achieve optimal performance, an activation function 
should have the following properties: (a) nonlinearity, 
(b) continuously differentiable, (c) monotonic, and 
(d) approximate identity around the origin. 
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Yout 


Fig. 1 An example of a simple neural network 
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Fig. 2 Commonly used neural network activation functions 


2.2 Training a Neural 
Network 


The training process typically involves three steps: (a) feed-forward, 
(b) backpropagation, and (c) weight adjustment. The architecture 
in Fig. 3 shows a neural network with one hidden layer, with 
weights vy and w. Note the inclusion of a bias neuron in each layer 
(except the output layer), which has an input value of constant 
1. The purpose of the bias neuron is analogous to that of an 
intercept when fitting a line to some data — it allows the activation 
function to be shifted as needed. 

The following is the algorithm which is used to train a neural 
network by backpropagating the error through the network to 
update its weights: 


Step 0: The weights are initialized to small random values (e.g., in 
the range —1 to 1). 

Step 1: While the stopping condition is false, do Steps 2-9. 

Step 2: For each training pair, do Steps 3-8: 
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Fig. 3 Backpropagation neural network with one hidden layer 
Feedforward: 


Step 3: Each input unit (X;,7= 1, ..., 2) receives input signal x; and 
broadcasts this signal to all units in the layer above (the hidden 


units). 
Step 4: Each hidden unit (Z;, 7= 1, ..., p) sums its weighted input 
signals: 
Zing = V0; + Pa > (1) 
applies its activation function to compute its output signal: 
2) = f (Zing)s (2) 


and sends this signal to all units in the layer above (output units). 


Step 5: Each output unit (Y%, k= 1, ..., m) sums its weighted input 
signals: 


P 
Yin, = Wor + >. gat (3) 


and applies its activation function to compute its output signal: 


Ve =f vine): (4) 
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Backpropagation of error: 

Step 6: Each output unit (Y,, k= 1, ..., m) receives a target pattern 
corresponding to the input training pattern, computes its error 
information term, 


bn = (th -— 9) f" (vina) > (5) 
calculates its weight correction term (used to update wg later): 
AW jp = 6,235 (6) 


where a is the learning rate (which determines the rate at which the 
weights are changed), calculates its bias correction term (used 
to update wo, later) 


Awor = abp, (7) 
and sends 6, to units in the layer below. 


Step 7: Each hidden unit (Z, 7 = 1, ..., p) sums its delta inputs 
(from units in the layer above), 


inj = So 8 its (8) 


multiplies by the derivative of its activation function to calculate its 
error information term, 


/ 
6; — Oingf (23.4) (9) 
calculates its weight correction term (used to update 9; later), 
Avi; = a6; Xi, (10) 
and calculates its bias correction term (used to update 7, later), 


Avo; = a6;. (11) 


Update weights and biases: 
Step 8. Each output unit (%, k = 1, ..., m) updates its bias and 
weights: 


(G = 0, ...,p) : Wie = Wie + Awye (12) 


Each hidden unit (Z, 7 = 1, ..., p) updates its bias and weights 
(4=0,..., 2): 


Vig = Vij + Aj. (13) 


Step 9. Test stopping condition. The stopping condition can be 
defined such that the training algorithm terminates once the 
change in the weights goes below a certain threshold, e.g., 
10-°. This ensures that the weights would have converged to 
some stable values. 
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3 Illustrative Example 


3.1 Dataset In order to train a neural network to predict the constituent path- 
way class(es) for a given metabolite, we obtained a dataset of 6669 
metabolites from Baranwal et al. [20]. This dataset was assembled 
from the Kyoto Encyclopedia of Genes and Genomes (KEGG) 
database [29]. Each metabolite is labeled as belonging to one or 
more classes, summarized in Table 1. 


3.2 Dataset The dataset consisting of the metabolites with their classes are 
Preparation and passed through a preprocessing procedure as shown in Fig. 4. 
Feature Engineering This preparation process is crucial for building effective 


machine learning models. This consists of the following three steps: 


1. Standardization: The metabolites may originate from different 
sources. This implies that the molecules themselves may be 
represented in different ways (e.g., with/without salts, differ- 
ent protonation states, different tautomeric states, etc.). Mole- 
cules in the dataset should be represented in a standard 
manner; otherwise, the same molecular entity may give rise to 
different representations (and descriptors used in our models). 
This also removes molecules in our dataset that are of little 
interest (e.g., single-atom entries). 

2. Clustering: Some of the metabolites may be similar to each 
other and would artificially inflate the performance of the 
machine learning models. For example, two metabolites, one 
in the training and the other in the testing set having the same 


Table 1 
List of KEGG pathway database classes 


Class ID Class name 


Carbohydrate metabolism 

Energy metabolism 

Lipid metabolism 

Nucleotide metabolism 

Amino acid metabolism 
Metabolism of other amino acids 
Glycan biosynthesis and metabolism 


Metabolism of cofactors and vitamins 


Ne cs PS cs Pe] &S Pe Pe 


Metabolism of terpenoids and polyketides 


a 
=) 


Biosynthesis of other secondary metabolites 


~ 
— 


Xenobiotic biodegradation and metabolism 
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Metabolites 


Dataset Preparation 


Descriptor 
Generation 


Model Building 
St 


Fig. 4 Data preprocessing procedure 


classes, may differ (only) by a methyl. The machine learning 
algorithm would be able to easily classify the testing set mole- 
cule, since it has been training using an almost identical mole- 
cule. In virtual screening, this is referred to as analogue bias 
[30]. Clustering is performed to remove these similar mole- 
cules (as well as identical ones). Only a representative molecule 
from each cluster is used as input to the machine learning 
models. 


3. Descriptor Generation: There are many ways in which a metab- 
olite may be represented. This could be a vector of physico- 
chemical properties (e.g., molecular weight, hydrogen bond 
donors, hydrogen bond acceptors, etc.), the topology of a 
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3.2.1. Standardization 


molecule (e.g., the graph or fingerprint representation), 3D 
descriptors which use the shape (conformation) of a molecule 
for representation, and other higher-dimensional descriptors 
(e.g., 3D + charge). The representation used may have a severe 
impact on the accuracy of the model. The suitable representa- 
tion is typically selected via experimentation. 


We implement these standardization, clustering, and descriptor 


generation steps using RDKit [31], a popular, free, and open- 
source cheminformatics toolkit. 


The main aim of our standardization process is twofold: (i) to 
enforce consistency across our dataset and results by representing 
all molecules in a uniform way and (1i) to make sure all molecules 
used to train and test the models are of high quality and error-free 
(e.g., incorrect valencies, etc.). We start by loading the molecules 
from a SMILES file [32], and we sanitize each molecule using 
RDKit. Among other functionality, this step includes the following: 


1. 


Corrects a number of nonstandard valance states (e.g., N 


(=0) = O - > [N+](=0)[0-]). 


. Calculates and checks explicit and implicit valences on all 


atoms. 


. Converts aromatic rings to their kekule form. Errors are raised 


if a ring cannot be kekulized or if aromatic bonds are found 
outside rings. 


. Identifies aromatic rings and ring systems (sets bond orders to 


aromatic). 


. Identifies conjugation in a molecule. 


6. Removes chiral tags from atoms that are not sp3 hybridized. 


. Addition of explicit hydrogen atoms to preserve chemistry 


(e.g., in heteroatoms in aromatic rings). 


The output of this sanitization is a list of molecules which are 


consistent with each other and may be read into RDKtt for proces- 
sing. Each valid molecule is passed through the standardization 
process. This includes multiple steps, which include the following: 


1. 


. Disconnect metal atoms. 


nan ek Ww bd 


Remove hydrogen atoms. 


. If many fragments are present, take the largest fragment. 
. Normalize functional groups. 


. Neutralize charges on the molecule, and then reionize com- 


mon functional groups in a standard way (note that RDKit 
does not have a pKa calculator and no attempt is made at 
ionization at some pH). 


. Canonicalize the tautomeric representation. 
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3.2.2 Clustering Analysis 


The original dataset contains a number of molecules with 
dummy atoms (denoted with * in SMILES). We replace these 
dummy atoms with a hydrogen atom. We also remove all molecules 
with less than five atoms. 


Clustering allows us to take representative molecules from our 
dataset. This ensures that the performance of our models is eval- 
uated in a more realistic way. Our clustering removes similar (and 
identical) molecules which are found in the dataset, making a more 
objective performance evaluation since the testing set will not 
contain trivially similar molecules as present in the training set. In 
this study, we use Butina clustering [33] on the small molecules and 
take the centroid of each cluster while discarding the other mem- 
bers of each cluster. This reduced our dataset from 6669 to 2171 
molecules. There are three steps to clustering analysis using Butina: 


1. Generation of fingerprints 
2. Identification of potential cluster centroids 


3. Clustering based on exclusion spheres 


Generation of fingerprints In the original work by Darko Butina, 
a 1024-bit fingerprint was generated using Daylight. To make the 
work reproducible using open-source software, we decided to gen- 
erate 1024-bit Morgan fingerprints with a radius of 2 (roughly 
equivalent to ECFP4) using RDKtt. 


Identification of potential cluster centroids The idea is that the 
molecules in a cluster with the largest number of neighbors (i.e., 
similar molecules) are most representative of the cluster as they are 
most like other members in their group. To compute the potential 
centroids, a similarity threshold is chosen (in our case 0.4). This 
threshold is then used to compute neighbors in the set (anything 
more similar than this threshold is considered as a neighbor). A 
sorted list of molecules, by descending number of neighbors, in the 
set is maintained. This is required for the algorithm to be deter- 
ministic (i.e., gives the same results every time it is run). 


Clustering based on exclusion spheres Starting from the first mole- 
cule in the list (i.e., the one with most neighbors), calculate its 
similarity to all other molecules in the set in a pairwise fashion. All 
those molecules with a similarity equal to or higher than the 
selected threshold become members of the same cluster (with the 
original molecule from the sorted list being the centroid of the 
cluster). This is known as an exclusion sphere (on the known 
cluster). This set of similar molecules forming a cluster is now 
ignored and cannot form part of another cluster (or act as a cen- 
troid). If a molecule in the sorted list has no similar neighbors 
(either all molecules have a similarity lower than the chosen 
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3.2.3 Descriptor 
Generation 


3.3 Model Setup and 
Training 


threshold or the similar molecules have been earlier assigned to 
another cluster), then it forms a singleton (a cluster with a single 
member). The Butina algorithm is known to generate many 
homogenous clusters. 


In order to build supervised machine learning models, we need to 
be able to represent molecules in some computer-readable way. 
These representations, or “descriptors,” describe molecular proper- 
ties using a set of numerical or categorical variables. There are many 
possible descriptors (e.g., Ultrafast Shape Recognition uses the 3D 
shape of a molecule to generate a vector of 12 real numbers 
[33, 34]), and different types of descriptors may affect the perfor- 
mance of our models. A common approach is to use a list 
(or vector) of numbers as a fingerprint to represent a molecule. 
Fingerprints may either be binary in nature, containing a zero or 
one in every position recording the absence or presence of a feature 
in a molecule, or else have numerical representations (such as 
counts of particular chemical moiety). In this work, we use two 
different fingerprint descriptors: extended-connectivity circular fin- 
gerprint (ECFP) [35] and molecular access system (MACCS) 
[36]. ECFP assigns each non-hydrogen atom in the molecule an 
identifier based on six atomic properties (valence, number of imme- 
diate non-hydrogen atoms, etc.). A radius parameter (2 in our case) 
is specified during the fingerprint generation which defines the 
atomic neighborhood to consider in an iterative manner. Each 
iteration captures a larger atomic neighborhood for each atom. 
These environments are then hashed in a fixed length binary list 
which records their presence. The length of our ECFP fingerprint is 
1024 bits. MACCS fingerprints are composed of an ordered binary 
list of 166 structural keys (e.g., does the molecule contain a 
Cl atom?). We generate these two fingerprints for each of our 
2171 metabolites to use as input to our machine learning models. 


As the inputs to the model consist of two binary vectors of lengths 
166 and 1024 respectively, for each metabolite, feature normaliza- 
tion is not required. 

We then randomly split the dataset into 80% (1737 samples) 
training and 20% (434 samples) test data using a stratified split 
approach which seeks to ensure that each class is represented with 
the same ratio between the train and the test sets. This was achieved 
using the zterative_train_test_split function within the skmultilearn 
[37] Python package. 

The Keras API library [26] was used to set up two neural 
network architectures (one for each type of binary fingerprint vec- 
tor) which could predict the constituent pathway class(es) for a 
given metabolite. The number of neurons in the input and output 
layer is fixed by the dimensionality of the input features and the 
output classes, while the number of hidden layers and the number 
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List of neural network hyperparameters and their ranges 


407 


Hyperparameter 


Values 


Number of hidden layers and neurons in (128, 64), (256, 64), (256, 128, 64), (512, 128), 


each layer 


Learning rate 


(512, 256, 128) 
Il % 108, Il x 1G", Il x 1O= 


Activation function ReLU, tanh, sigmoid 


Optimizer 


Adam, stochastic gradient descent, RMSProp 


of neurons in each hidden layer form part of the set of hyperpara- 
meters for the models. As opposed to the model parameters 
(or weights) which are learned during the training process, hyper- 
parameters are tunable parameters which are used to control the 
training process. They are usually established by rule of thumb or 
else through optimization procedures, which seek to determine the 
set of hyperparameter values that enable the model to achieve the 
best performance. Other neural network hyperparameters include 
the learning rate, the optimizer to be used, and the activation 


function. 


Hyperparameter optimization was carried out via grid search. 
With this technique, a scan is performed over various hyperpara- 
meters, with the best set of hyperparameters being determined via 
the per-class accuracy as explained in Subheading 3.4. A list of the 
hyperparameters and their corresponding ranges is shown in 


Table 2. 


K-fold cross-validation (with K = 5) was also performed in 
conjunction with hyperparameter optimization. The validation 
process involves not using part of the dataset for training, which 
may pose a problem of underfitting. Therefore, in K-fold cross- 
validation, the training dataset is partitioned into K equally sized 
subsets (which contain samples drawn randomly), and the model is 
trained on the remaining data and evaluated on the K‘” subset. The 
performance is then computed over the K subsets, which allows for 


better generalization. 


Diagrams of the architectures used are shown in Figs. 5 and 6, 
showing the final neural network structures which gave the best 
performance. A learning rate of 1 x 10~* was used. The Rectified 
Linear Unit (ReLU) activation function was used throughout the 
network except for the final output layer, for which the sigmoid 
function was used. This is necessary in order to obtain a multi-label 
output. The binary cross-entropy (or log-loss) loss function was 


used, which is given by: 


H,(q) = oN yi. log (p(9;)) 
saa oe y;)-log (1 — p(y;)) 


(14) 
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Fig. 5 Neural network architecture for the MACCS dataset resulting in a total of 
around 60,000 weights 


1024 


512 


128 
11 


Fig. 6 Neural network architecture for the Morgan dataset, resulting in a total of 
around 1.64 million weights 


where N is the number of samples, y is the label, and p(y) is the 
predicted probability of the sample being correct. Finally, the Adam 
optimization algorithm [38] was used to update the network 
weights. This algorithm combines the advantages of two other 
optimization algorithms which in turn are extensions of stochastic 
gradient descent, namely, adaptive gradient (AdaGrad) algorithm 
and root mean squared propagation (RMSProp). 


3.4 Model 
Performance 
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The performance of the trained neural networks was evaluated 
using the 20% unseen testing dataset. As the predictor is a multi- 
label one, we cannot compute a global accuracy, but instead we 
obtain results on a per-class basis. Apart from the accuracy, which 
represents the fraction of correct predictions made by the model, 
the precision (the fraction of relevant instances among the retrieved 
instances) and recall (the fraction of relevant instances that were 
retrieved) are also computed: 


_ TP+ TN 
Accuracy TP IN + EP TEN (15) 
wos TP 
Precision = =p pp (16) 
TP 
Recall = TP EN (17) 


Figure 7 shows the per-class accuracy, precision, and recall 
obtained on the MACCS features. The model generally achieves a 
good accuracy; however, it performs less well with regard to preci- 
sion and recall, in particular for classes 2 and 7. 

An explanation of the poorer performance in terms of precision 
and recall can be found in Fig. 8, which shows how another metric, 
Fl-score, which combines both precision and recall as follows: 


dx precision x recall 


Fl — score = — 
precision + recall 


(18) 


Mmm Accuracy 
Mm Precision 
Mmm Recall 


Class 


Fig. 7 Per-class accuracy, precision, and recall obtained on the unseen testing set for MACCS distinct 
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Fig. 8 Linear fit applied to the F1-score obtained on the unseen testing set for MACCS distinct as a function of 
the number of “1” labels per class 


varies with the number of samples (i.e., number of instances when 
the prediction should be “1”) for each class. 

Informally, the hamming loss is the fraction of labels that are 
incorrectly predicted. The hamming loss is formally defined as: 


HL es DXi, (19) 


where X;, ; is the target, Yj, ; is the prediction, — denotes the 
exclusive or operator which returns 0 when the target and predic- 
tion are identical and one otherwise, L is the number of labels, and 
Nis the number of samples. As this is a loss function, the optimal 
value is zero. A hamming loss of 0.0667 was obtained for the 
MACCS features. 

Visual demonstrations of the model performance can be 
obtained through receiver operating characteristic (ROC) and 
precision-recall (PR) curves shown in Figs. 9 and 10, respectively. 
An ROC curve is obtained by plotting the true positive rate 
obtained as a function of the false-positive rate, as the classifier’s 
discrimination threshold is varied. A good model obtains a high 
true positive rate for a corresponding low false-positive rate, result- 
ing in graphs which go through the top left hand corner of the plot. 
A random classifier would result in the dotted red line with the true 
positive rate equal to the false-positive rate. The area under the 
curve (AUC) is another metric that can be obtained from ROC 
curves, with a perfect classifier having an AUC of 1. 
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Fig. 9 Per-class receiver operating characteristic curves showing the area under the curve (AUC) obtained 
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Fig. 10 Per-class precision recall curves showing the area under the curve (AUC) obtained 


On the other hand, a PR curve is generated by plotting the 
precision as a function of the recall. In this case, the ideal curve 
passes through the top right-hand corner. A random classifier 
would result in a recall of zero with varying precisions. 

Figures 11 and 12 show the per-class accuracy, precision, and 
recall obtained for the Morgan features, as well as the Fl-score 
varying with the number of samples. A hamming loss of 0.0766 was 
obtained. The per-class ROC and PR curves are shown in Figs. 13 


and 14, respectively. 
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Fig. 11 Per-class accuracy, precision, and recall obtained on the unseen testing set for Morgan distinct 
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Fig. 12 Linear fit applied to the F1-score obtained on the unseen testing set for Morgan distinct as a function of 
the number of “1” labels per class 
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Class 9: AUC = 0.96 
Class 10: AUC = 0.91 
Class 11: AUC = 0.95 
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Fig. 13 Per-class receiver operating characteristic curves showing the area under the curve (AUC) obtained 


No skill 

Class 1: AUC = 0.47 
Class 2: AUC = 0.33 
Class 3: AUC = 0.72 
Class 4: AUC = 0.50 
Class 5: AUC = 0.60 
Class 6: AUC = 0.20 
Class 7: AUC = 0.33 
Class 8: AUC = 0.77 
Class 9: AUC = 0.89 
Class 10: AUC = 0.81 
Class 11: AUC = 0.89 


Precision 
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Fig. 14 Per-class precision recall curves showing the area under the curve (AUC) obtained 


4 Conclusion 


Understanding the mechanisms and structural mappings between 
molecules and pathway classes is an important step toward design of 
reaction predictors for synthesizing new molecules. In this chapter, 
we provided an in-depth look at machine learning and neural net- 
works and demonstrated how these techniques can be applied to 
the problem of predicting the association of metabolites to their 
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respective metabolomic pathway classes. The dataset preparation 
procedure, including standardization, clustering, and descriptor 
generation, is presented, and we also discuss a number of metrics 
and methods which can be used to evaluate the performance of the 
model. The Python code for this chapter is made available at 
https://github.com/jp-um/machine_learning_for_metabolomic_ 


pathway_analyses. 
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Machine Learning and Hybrid Methods for Metabolic 
Pathway Modeling 
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Abstract 


Computational cell metabolism models seek to provide metabolic explanations of cell behavior under 
different conditions or following genetic alterations, help in the optimization of in vitro cell growth 
environments, or predict cellular behavior in vivo and in vitro. In the extremes, mechanistic models can 
include highly detailed descriptions of a small number of metabolic reactions or an approximate represen- 
tation of an entire metabolic network. To date, all mechanistic models have required details of individual 
metabolic reactions, either kinetic parameters or metabolic flux, as well as information about extracellular 
and intracellular metabolite concentrations. Despite the extensive efforts and the increasing availability of 
high-quality data, required in vivo data are not available for the majority of known metabolic reactions; 
thus, mechanistic models are based primarily on ex vivo kinetic measurements and limited flux information. 
Machine learning approaches provide an alternative for derivation of functional dependencies from existing 
data. The increasing availability of metabolomic and lipidomic data, with growing feature coverage as well as 
sample set size, is expected to provide new data options needed for derivation of machine learning models of 
cell metabolic processes. Moreover, machine learning analysis of longitudinal data can lead to predictive 
models of cell behaviors over time. Conversely, machine learning models trained on steady-state data can 
provide descriptive models for the comparison of metabolic states in different environments or disease 
conditions. Additionally, inclusion of metabolic network knowledge in these analyses can further help in the 
development of models with limited data. 

This chapter will explore the application of machine learning to the modeling of cell metabolism. We first 
provide a theoretical explanation of several machine learning and hybrid mechanistic machine learning 
methods currently being explored to model metabolism. Next, we introduce several avenues for improving 
these models with machine learning. Finally, we provide protocols for specific examples of the utilization of 
machine learning in the development of predictive cell metabolism models using metabolomic data. We 
describe data preprocessing, approaches for training of machine learning models for both descriptive and 
predictive models, and the utilization of these models in synthetic and systems biology. Detailed protocols 
provide a list of software tools and libraries used for these applications, step-by-step modeling protocols, 
troubleshooting, as well as an overview of existing limitations to these approaches. 


Key words Metabolism modeling, Hybrid modeling, Metabolomics, Lipidomics, Flux analysis, 
Machine learning 
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Introduction 


Whether for production of biologics or bioremediation in meta- 
bolic engineering, understanding different metabolic states under 
physiological and disease conditions to identify new therapeutic 
targets or for predictive modeling of cell behavior in a changing 
environment, computer modeling of cell metabolism provides an in 
silico platform to test optimal culture conditions, intervention, or 
impact of target engagement. Such models have been used to 
advantage in multiple biopharmaceutical applications [1], drug 
target identifications [2], toxicogenomics including comparison 
of animal and human cell response [3], and, as detailed, kinetic 
models of simple cell systems, including red blood cells (erythro- 
cytes) [4] and platelets [5]. These models can be further expanded 
into major biotechnology platforms designed to optimize the engi- 
neering of CHO cells for biologics [6] and HEK293 cells for 
vaccine particle production [7] and characterize the metabolic 
changes that influence pluripotency and stem cell fate [8]. 

Classical, mechanistic, cell metabolism models, generally, are 
either dynamic models that include detailed kinetic information for 
a limited number of reactions or steady-state, constrained models 
that simulate stationary behavior of a larger cellular, tissue, or 
organismal system [9]. These models are built based on biological 
knowledge and only for known metabolic reactions where subsets 
of reaction or flux parameters are optimized using data to fit specific 
conditions. Kinetic models allow dynamic simulation of the change 
in the system over time; constrained models assume the system is in 
steady-state, thus, only allowing simulation of the flux through 
reactions with the assumption of constant metabolite concentra- 
tions on the simulation timescale. When choosing between these 
extremes, the modeler is faced with a trade-off between the size of 
the model and the level of detail provided by the predicted 
solutions. 

Different combinations of methods have been proposed to 
model metabolism including efforts to develop a genome-scale 
kinetic model combining large network coverage with detailed 
reaction and metabolite concentration analysis (reviewed in 
[10, 11]). Bringing together different types of mechanistic models, 
however, attempts to alleviate the deficits of constraint-based mod- 
els given their lack of information about dynamic metabolite con- 
centration and enzyme regulation while optimizing the kinetic 
framework to reduce shortcomings associated with nonlinearity, 
parameter identifiability, and uncertainty. Although these com- 
bined approaches can bring metabolism modeling closer to the 
optimal large scale, they fully depend on a priori biological knowl- 
edge. Moreover, the reality is that they will encompass multiple 
unknown parameters that require optimization or testing for 
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i Theory inspired ML models 
define kinetic rate equations NLP literature analysis 
if ML functional analysis from data 
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Fig. 1 Examples of possible contributions of ML in different steps of mechanistic model development. Although 
ML can be used for building of the complete model or for combination of model and experimental data, it can 
also help in determination of parameters and optimization of specific steps in development and utilization of 
mechanistic models. NLP natural language processing, ODE ordinary differential equations 


outcome across many combinations and large ranges of values. 
Such hybrid models are, by their very nature, both computationally 
demanding and data-intensive. The application of machine learning 
(ML) methods to these models can address some of these issues. 
ML enables various types of data to be used simultaneously as well 
as provide more appropriate data-driven approaches that can pro- 
vide more efficient parameter searches and more accurate, unbiased 
data-driven modeling. Thus, ML can both contribute to specific 
steps in the mechanistic model development, as outlined in Fig. 1, 
and present new global approaches for the expansion of hybrid 
methods combining both constrained and kinetic modeling 
described below. 

ML combines sets of algorithms that develop predictive models 
through experience, i.e., through learning and functional generali- 
zation from data. ML models can be developed only from the data 
and do not require any prior knowledge; however, they also benefit 
from inclusion of domain knowledge that can optimize ML meth- 
odology for specific applications. In this way, prior knowledge can 
reduce training data needs. ML methods can also contribute to 
individual steps in the development of a mechanistic model (Fig. 1 
shows some possible applications). In this context, ML is not used 
in modeling but helps to gather information, optimize parameters, 
or provide better solvers for differential equations [12]. Alterna- 
tively, ML can be further integrated into mechanistic models to 
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1.1. From 
Mechanistic to ML 
Models, There, and 
Back Again 


provide analysis of the results or include theoretical information for 
development of knowledge-inspired metabolic ML models. There- 
fore, combined ML/mechanistic methods into “hybrid” cell 
metabolism models can augment the mechanistic knowledge 
about metabolism and the kinetics of metabolic reactions with 
data-driven methods for describing unknown parts of the system 
or for describing, more effectively, the underlying complexity of the 
system. These hybrid models are a very recent innovation, with 
great potential to provide new insight in metabolism and its influ- 
ence on organism and cellular fate. 


The first hybrid model in systems biology was presented in 2010 
[13], yet this potentially transformative approach remains in its 
infancy due to the complexity of the problem and the lack of 
appropriate data for most applications. In the most general case, 
mathematical modeling attempts to combine both internal and 
external metabolic reactions and interactions with ultimate goal to 
provide simulation of the complete metabolic network in all its 
detail, including complete metabolic pathways and individual reac- 
tions as well as activation and inhibition with formal, numerical 
representation providing as high level of accuracy and detail as 
achievable within our current level of information. For well- 
described systems and reactions, it is possible to develop highly 
accurate, mechanistic models, presenting detailed dynamic reaction 
information and providing the change in metabolism over time via 
differential equations allowing inclusion of the effects of inhibitors 
or activators of enzyme functions. The increasing availability of 
longitudinal omics data will allow optimization of kinetic para- 
meters in these models. However, for the majority of reactions, 
this level of information is not available, and modeling is only 
possible using approximations of kinetic process simulation (e.g., 
Michaelis-Menten equation) or by reducing studies by assuming 
steady state and constraining potential responses, thereby making it 
possible to model a larger number of reactions. Hybrid models 
employing ML have been fueled by the increasing availability of 
large amounts of biomolecular data. ML models increase calcula- 
tion speed, but, even more importantly, ML can assist in creating 
models for systems for which there is limited knowledge via a data- 
driven approach. ML methods can furthermore be used to combine 
data from different sources including multiomic data, enhance 
mechanistic models by providing additional in silico data, and 
optimize methods for parameter determination. ML methods can 
help in building and executing simulations to test outcome. 
Kinetic models of metabolism integrate enzyme regulation and 
multiomic data with reaction network information to provide 
dynamic analyses and predictions of metabolite concentrations. 
These models present mechanistic representation of the processes 
in cells defined as a series of ordinary differential equations (ODEs) 
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and include details of rate expression and kinetic parameters to 
estimate dynamic behavior of each reaction in the model. The 
mathematical form of the model is shown in Eq. 1: 

ac; . 

Fe SVE, &)s t= 1,2...0 (1) 
where c; is the concentration of metabolite z, Sis the stoichiometric 
matrix, and V is the vector representation of reaction flux that 
depends on E (the enzyme abundance), C (metabolite concentra- 
tions), and e (kinetic parameters for the reaction). Equations are 
written for each metabolite in the system, requiring knowledge of 
appropriate parameters for each and every reaction. The sets of 
ODEs are then solved often using various approximation methods, 
including as two examples Michaelis-Menten or Hill kinetic equa- 
tions [14]. As a result, the majority of kinetic models focus on a 
small subset of reactions within specific pathways. Kinetic models 
have been developed for a number of metabolic pathways in differ- 
ent organisms and are made available through dedicated reposi- 
tories (listed in Table 1). While useful and effective, the possibility 
to develop large, genome-scale kinetic models remains challenging 
given issues of kinetic model nonlinearity, computational tractabil- 
ity, parameter identifiability, estimability, and uncertainty [10]. 

While kinetic information is available for a number of enzymes 
in several detailed databases [15, 16] (reviewed in Table 1), the 
majority of kinetic constants have been measured ex vivo. Without 
empirical validation, it is possible that they inadequately represent 
the in vivo situation. More accurate determination of kinetic para- 
meters requires optimization from data; however, models generally 
have problems in identifying and optimizing large numbers of 
parameters given nonlinear mechanistic rate equations. Simplified 
kinetic models have been explored for different applications by 
either reducing the size of the pathway space or simplifying kinetic 
equations. Such approaches require optimization of these approxi- 
mate parameters for each case. Improvements in the optimization 
and fitting of models to data have been proposed with methods 
such as approximate Bayesian computation (ABC) [17] presented 
as a way to improve fitting strategy by sampling values from an 
approximation of the posterior distribution while not calculating 
explicitly the likelihood function. 

The alternative to kinetic models, constraint-based modeling, 
lacks the representation of metabolite concentration and enzyme 
regulation afforded by kinetic models. Instead, these so-called 
genome-scale metabolic models (GEMs) combine gene sequence 
information with omics data to provide a map of intracellular 
metabolism for an organism through calculation of the stoichio- 
metric matrix. GEMs have been used for a number of different 
applications, for example, flux balance analysis (FBA) [18] or met- 
abolic balance analysis (MBA) [19] as well as testing of synthetic 


Table 1 


Examples of resources available for model development, ML examples, as well as metabolic models 


Metabolism model 
development 


Bayesian modeling 


Logical modeling 


Dynamic modeling 
through ordinary 
differential equations 


Stochastic modeling 


Stoichiometric modeling 


Agent-based modeling 

ML tools 

Longitudinal GPR 
(LonGP) 

LSTM used in 
metabolism modeling 

Metabolism model 
database 

BioModels 

SABIO-RK 


BRENDA 


eQuilibrator 


Software application 


GRASP [17] 


CellNetOptimizer (http: // 


www.cellnopt.org) 


GINsim (http://ginsim.org) 


COPASI [63] 
CellDesigner [64] 
VCell [65] 


COPASI [63] 
StochKit [68] 


MaBoSS (http://maboss. 


curie.fr) 


COBRA [57] 
CobraPy [57, 70] 
Raven 2.0 [58] 
Merlin [71] 


ARCADE [73] 


Software application 


https: //github.com/ 
chengl7/LonGP [40] 


https://github.com/youlab/ 
pattern_prediction_NN_ 


Shangying [37] 


Software application 


https: //www.ebi.ac.uk/ 
biomodels [74] 


http: //sabio.h-its.org/ 

[75] 

https: //www.brenda- 
enzymes.org/ [16] 


https: //equilibrator. 
weizmann.ac.il/ [76] 


Examples of applications in cell culture 
metabolomics 


Methionine cycle modeling using 
approximate Bayesian computation [17] 


Combination of cell line proteomics and 
metabolomics data logic mechanistic | 
modeling to explain heterogeneous drug 
response in cellular cholesterol regulation 


[62] 


Many examples of COPAST’s use in 
biotechnology cell modeling are reviewed 
in [66] recent example of hybrid 
cybernetic modeling that combines 
dynamic modeling between different 
metabolic states for CHO cells [67] 


Theoretical foundation to study metabolism 
in conjunction with stochastic enzyme 
expression has been presented showing 
metabolic heterogeneity resulting from 
enzyme-level stochasticity [69 | 


Genome-scale stoichiometric 
reconstructions and computational 
models of mammalian metabolism 
particularly for CHO cells coupled to 
protein secretion [72] 


Extensive review of agent-based methods for 
cancer cell modeling [37 | 


Examples of some application in cell 
culture metabolomics 


Additive GPR method for non-parametric 
analysis of longitudinal data 


LSTM for improvement of parameter 
modeling based on mechanistic models 

Type of resource 

Model repository 

Kinetic information 


Kinetic information 


Database of biochemical equilibrium 
constants and Gibbs free energies 
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lethality of genes [20] and determination of off-target drug effects 
[21]. GEMs are built on a network connection of all metabolic 
reactions that are known to occur in an organism combining meta- 
bolites, genes, and protein information to inform observed changes 
in metabolite concentrations across conditions. 

The potential to determine and work from the entire metabolic 
reaction network derived directly from genome information opens 
an opportunity for building complete metabolic maps for any 
organism as well as subsets of metabolic networks for different 
biological systems. GEMs can simulate flux for all known metabo- 
lites. Additionally, they can provide a platform for multiomic anal- 
ysis as well as a system for an evaluation of the complete 
metabolome space with sparse metabolomic profiling data. How- 
ever, their reaction maps are often underdetermined, with more 
reactions than metabolites; thus, they generate many possible solu- 
tions often too complex for the majority of applications [1]. A 
number of approaches to address this issue and simplify these 
models for specific applications include the utilization of transcrip- 
tomic, proteomic, and metabolomic data to remove unlikely reac- 
tions as well as the addition of biological, physical, or chemical 
constraints [22—24]. Gene expression data is commonly used to 
extract the subset of reactions that are active in a specific situation 
and silence reactions catalyzed by enzymes that are not expressed. 
Although this approach is efficient, it makes a very serious assump- 
tion that gene expression activity measured at a given time pointina 
mixture of cells is linked to gene-protein-reaction network at steady 
state. This assumption is an oversimplification of the highly com- 
plex relationship between proteins, metabolite fluxes, and gene 
expression. As an example, the most complete GEM for metabo- 
lism of human cells — Recon3D — provides a network of 10,600 
reactions linking 5835 metabolites and 2248 genes [25]. Recon3D 
provides avery good coverage of hydrophilic metabolites; however, 
while it includes a number of lipid pathways, its coverage of the 
lipidome is essentially incomplete, making it difficult to extend 
beyond metabolomics. 

The lack of network solutions for lipidomic data makes lipido- 
mics highly amendable to data-driven modeling. Development of 
mechanistic lipid metabolism kinetic models or a complete repre- 
sentation of lipid processes via GEMs remains highly challenging 
due to the diversity of lipid functions and their enzymes. As classi- 
fied by the LIPID MAPS consortium [26], lipids are divided into 
eight categories and further subdivided into multiple classes, sub- 
classes, divisions, and molecular species each with specific roles and 
synthesized or remodeled by overlapping enzymatic pathways. Cur- 
rent estimate of the number of lipid species in biological life ranges 
from 9000 to 100,000 [27]. This diversity in lipid structures and 
functions makes the mapping of all interconnections of lipids 
impossible as of today. In addition, the enzymes which regulate 
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1.2 Improving Cell 
Metabolism Modeling 
with ML 


lipids are promiscuous, catalyzing several different reactions with 
different specificities for the hydrocarbon chains that define lipid 
identities [28 ]. Without detailed substrate affinities, it is difficult to 
predict which lipids at the molecular level will be impacted by a 
change in condition or state. As a further challenge to all metabo- 
lomic modeling, cellular reactions are compartmentalized, with 
enzymes localizing to specific organelles within cells and to specific 
tissues within an organism. Thus, modeling must consider not only 
lipid abundances and enzymatic function but also their transport 
and, ideally, their subcellular concentrations. As an example, acid 
ceramidase encoded by ASAH1 localizes to the lysosome and cat- 
alyzes the hydrolysis of ceramides to their constituent sphingoid 
base and free fatty acid at pH = 4.5. If the enzyme is mislocalized or 
lysosomal pH is alkalinized, then acid ceramidase catalyzes the 
reverse reaction, increasing the abundance of ceramides from a 
sphingoid base and a free fatty acid [29, 30]. Under physiological 
conditions, acid ceramidase displays substrate preference for cera- 
mides and free fatty acids with unsaturated N-acyl hydrocarbon 
chains of 6-16 carbons [29]. 


ML methods can be viewed as a combination of algorithms that 
learn and generalize functional dependencies from experiences, 
data, to identify high-order correlations and then generate predic- 
tions from data. At the most basic level, ML methods can be 
divided into two approaches: unsupervised and supervised. Unsu- 
pervised methods aim to determine variation, correlations, groups, 
or functional dependencies among samples without any input of 
sample labels from an external “supervisor” [31]. Supervised meth- 
ods on the other hand rely on the inputted sample labels and try to 
develop models that predict targets and underlay the supervised 
group classification. Regression analysis is part of supervised ML, 
where algorithms are trained with input and output features to 
provide predictive modeling for continuous outcome (e.g., metab- 
olite concentration over time) based on the value of one or more 
predictor, input value, system parameter, or condition 
characteristic. 

Specific roles of ML in combination with mechanistic metabo- 
lism modeling are: 


1. Integration of in silico mechanistic modeling results with other 
omics data. 

2. Determination of parameters for mechanistic models from 
data- or theory-driven ML. 


We review example methods that have been applied with suc- 
cess below and then provide specific methodology protocols. 


1.2.1 Integration of in 
Silico Mechanistic 
Modeling Results with 
Other Omics Data 


1.2.2 Determination of 
Parameters for Mechanistic 
Models from Data- or 
Theory-Driven ML 
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To achieve integration, the user must first develop and optimize a 
mechanistic model and then use the data obtained from this model 
for ML analysis of the system. ML system exploration can use the 
results of the simulation or combine simulation outputs with other 
relevant data about the system under investigation. As a proof of 
principle, a combination of ML and multiomics data were used to 
effectively predict pathway dynamics in [32, 33]. In this approach, 
metabolism models can be done at any scale from whole network 
GEM models to very small models including successful recapitula- 
tion of lipid metabolism (reviewed in [34]). Here, ML is subse- 
quently used as a tool for data mining rather than modeling. A 
small number of examples, combining GEM and ML methods, 
have shown potential for utilization of both supervised and unsu- 
pervised ML for this type of application. As an example, when used 
for analysis of the effect of inhibitors on metabolism, GEMs can 
provide simulation of flux differences following disruption of a 
specific metabolic step. In this approach, ML can be used to deter- 
mine major changes across the network between control and in 
silico “treated” cases. Shaked et al. [35] have used support vector 
machine (SVM) and random forest (RF) ML methods to determine 
major metabolic alterations from simulated flux data obtained 
using flux variability analysis (FVA) following inhibitory drug sim- 
ulation through gene deletion analysis. In this way, ML was used to 
determine drug side effects on the metabolic network [35]. In 
another very significant application, GEM and ML models were 
combined during learning tasks by embedding stoichiometric con- 
straints in the ML model training process [36]. In this approach, 
dynamic elementary mode regression discriminant analysis was 
developed to identify the most discriminant pathway activation 
patterns between different conditions [36]. 


Mechanistic models require optimization of parameters from data 
where, in the majority of cases, models cannot be solved analyti- 
cally; thus, parameter optimization requires numerical methods. 
These methods are often slow and, for a large number of para- 
meters with exponentially increasing number of combinations, 
unable to perform large-scale explorations of the complete param- 
eter space. Yet the complete parameter space must be interrogated 
in order to determine global, optimal parameter or input choices. 
Long short-term memory (LSTM) deep learning-based network 
analysis method has shown promising results for the acceleration of 
this parameter optimization with high accuracy [37]. LSTM was 
introduced as a way to resolve problems of exploding/vanishing 
gradients that recurrent or very deep neural networks face when 
trying to learn long-term dependencies [38]. LSTM has been 
developed for processing continuous series of data [39] including 
time course sequences (as is usually the case in mechanistic models) 
or series of outcomes for combinations of input parameters 
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(as needed for optimization of model routine). The strength of 
these deep learning methods lies in the capacity to establish a map 
of outcomes from the training data. In the LSTM application, a 
small subset of data generated using mechanistic models is used to 
train neural network that then provides faster coverage of the 
parameter space to determine optimal combination for a given 
system. 

A very detailed outline of LSTM methodology with examples 
of LSTM architecture used for metabolism modeling is provided in 
[37, 38]. In this arrangement, the cell remembers, i.e., holds, 
values over some time or point intervals, and the gates control 
and regulate the flow of information into the cell. LSTM is ulti- 
mately built from a set of recurrently connected subnetworks where 
each block maintains its state and regulates information flow 
through its nonlinear gating units. In the applications reviewed in 
[37, 38], LSTM is used to determine mechanistic model input 
parameters as it was able to search through a larger space of param- 
eter options with a relatively small training set of random para- 
meters and mechanistic model predicted molecular outputs. LSTM 
networks were shown to provide reliable and, most importantly, 
novel patterns of parameters suggesting that they are not limited to 
passive repetition of the training information but provide real 
mapping between input and output parameters. In this approach, 
neural network model building focuses on an empirical mapping of 
combinations of input parameters to system outputs of interest and 
provides a much faster way to search input parameter space while, at 
the same time, providing very accurate models for output para- 
meters. For exploration, Vanilla LSTM is readily available in Python 
or MATLAB applications. 

An alternative approach to training ML models with data and 
mechanistic models is to use biological knowledge to develop more 
appropriate ML models that can then be trained with smaller 
datasets providing knowledge-constrained modeling. Gaussian 
process regression (GPR) is a method of great interest in this type 
of application. In GPR, analysis and modeling of time-series data 
and the determination of parameters and models can be viewed as a 
regression problem where the goal of inference is to determine the 
putative form of the time-dependent function and to obtain the 
probability distribution of the dependent value on the variable. In 
the sense of metabolism modeling, regression problems would take 
the form of c(t) = f(@(4)) + €. This functional dependence deter- 
mination can be viewed as a curve fitting that assumes that ¢c) is 
ordered by ¢2, where c, is a function of time. GPR models can 
provide nonlinear system modeling, can be trained with smaller 
datasets, and can automatically output values that include the vari- 
ance and confidence interval of the model. In addition, prior 
knowledge can be incorporated into the GPR model before train- 
ing through optimization of covariance and kernel function. Here, 
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Fig. 2 Brief outline of two approaches linking mechanistic and machine learning (ML) models for (a) using ML 
for combined analysis of simulation results and omics data and (b) using ML for increased parameter space 
search coverage in order to increase 


kernels can be viewed as flexible nonlinear functions that can be 
optimized and developed to define how quickly the regression 
function will vary. A related example of utilization of GPR in 
modeling of longitudinal processes was recently presented in [40]. 

Although many different ML approaches can be combined 
with mechanistic modeling in a variety of ways and for a range of 
applications, a number of similar procedural steps are required for 
application of any ML method in either analysis of model-derived 
data or augmentation of mechanistic models. Method section lists 
procedures for utilization of LSTM and GPR in modeling with 
similar protocols required for other ML model utilizations. The 
Materials section below provides some software tools and links to 
major metabolism modeling databases. The Methods section below 
provides detailed protocols with Fig. 2 giving a schematic presen- 
tation of these procedures. 
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2 Materials 


3 Methods 


3.1 Using 
Mechanistic Models to 
Produce Data for 
Incorporation into ML 
Classifiers 


3.1.1 MS-Based 
Lipidomic and 


Information about Web resources providing data, information, and 
software for metabolic modeling that can support ML and hybrid 
model development is presented in Table 1. 


Development of a high-quality model relies upon (1) the intimate 
knowledge of the system in question, (2) the articulation of appro- 
priate hypotheses to test the models using experimental data, and 
(3) a feedback workflow to inform the model for rebuilding and 
validation. The experimental data used for modeling should be 
obtained using robust, high-throughput, analytical techniques 
that allow for rapid identification and reliable quantification of 
metabolites. In this context, metabolomic and lipidomic datasets 
are predominantly generated by mass spectrometry (MS)-based 
and nuclear magnetic resonance (NMR) approaches. Brief outline 
of methods is shown in Fig. 3. 


MS offers a sensitive, quantitative, technical solution and includes 
the possibility of devising and coupling experiments to produce 


Metabolomic Data structural information of countless metabolites in a single acquisi- 
tion. Considerations of data processing are as follows: 
A ML for combined analysis of B_ Faster search of parameter 
mechanistic modeland data — = space through ML 


Mechanistic model 


Generate random parameter set: 
ed Media concentrations, kinetic 
Data ac” parameters, ... 


| ® input 


odel optimization co oe ¢ ML model cad 
| weetment / 


Model generated data * Trainingand 


=== ub Mechanistic model 
: output Concentrations, fluxes, ... 
Test set validated 


Additional data model 


ML analysis Faster search of cages 


parameters’ space 


Fig. 3 Schematic representation of NMR- and MS-based metabolomics and lipidomics analysis providing data 
for model development. Included are major steps going from sample preparation, analytical methodologies, 
assignment, and data preprocessing 


Hybrid Methods for Metabolic Pathway Modeling 429 


1. Untargeted MS analyses provide an unbiased approach to 
simultaneously measure a large number of metabolites or lipids 
within a sample without prior knowledge of lipid and metabo- 
lite categories. Strengths are the broad coverage afforded by 
the high-resolution mass analyzers used to discriminate lipids 
based on mass to charge (m/z). Weaknesses lie in the complex- 
ity of the matrices analyzed such that high abundance metabo- 
lites are favored over low abundance ones despite multiple 
front-end separation approaches (i.e., gas chromatography, liq- 
uid chromatography, ion mobility, etc.). Quantification is done 
in a semiquantitative manner. Without reducing matrix com- 
plexity, the large quantity of metabolites and lipids results in 
ion suppression due to co-elution, as well as in detector satura- 
tion. These limitations are offset by the high-resolution mass 
scanning of the precursor ion which enables identification 
based on m/z. A comprehensive review of the technologies is 
provided in [41, 42]. 

2. Targeted MS analyses focus on a predefined set of metabolites 
and lipids by parking on a diagnostic ion using triple quadru- 
pole or QTRAP mass analyzers wherein the third quadrupole 
can be switched to trap fragmented ions for structural verifica- 
tion (reviewed in [41, 42]). By coupling chromatography to 
targeted MS methods, higher-resolution and more reliable 
quantification of metabolites can be achieved. In addition to 
derivatization by GC, a variety of LC methods such as normal 
phase, reversed phase, and hydrophilic interaction LC, ion pair 
chromatography is another strategy commonly employed in 
metabolomic analysis for the separation of ionic metabolites 
[41, 42]. The targeted metabolomic and lipidomic pipelines 
generally utilize tandem mass spectrometry to obtain high 
selectivity, enhanced sensitivity, and reliable quantification of 
metabolic targets by reducing noise from isobaric species. As 
such, targeted MS analyses aim to perform close to absolute 
quantification. This is achieved by performing tandem MS 
experiments such as multiple reaction monitoring (MRM, 
with or without schedule) to restrict analysis to a predefined 
set of metabolites or lipids. The data reduces complexity by 
quantifying a single lipid or metabolite subclass at a time (aka 
exploring 1000 in lieu of ~10,000 metabolites at a time). 
Limitations are the number of analyses required to explore 
the entire lipidome/metabolome. It is important to note that 
data from both untargeted and targeted approaches comple- 
ment metabolomic modeling approaches. 


3. Post-acquisition data processing in both MS approaches 
involves noise filtering and baseline correction, peak detec- 
tion /selection, adduct annotation and deisotoping, peak align- 
ment, and further deconvolution if necessary. Typically, in 
untargeted MS analyses, due to the broad coverage of 
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3.1.2 NMR-Based Data 


metabolites, the mass spectrum and chromatogram are 
saturated with noise signals. The removal of these noise signals 
involves establishing a set threshold and subtracting this 
threshold from the measurement. Similarly, this type of analysis 
likely will also contain detection of isotopic peaks of metabo- 
lites, which need to be removed to simplify the final dataset. 
For both untargeted and targeted MS analyses, specific para- 
meters such as Gaussian smoothing, peak splitting, acceptable 
peak width, and retention time windows must be established 
for peak picking. This ensures consistency in data analysis and 
avoids false-positive signals. Finally, peak alignment is an 
important step in post-acquisition data processing to obtain 
correct identity assignment for each MS signal. Peak alignment 
and annotation are often performed by multiple peak features 
dependent on the separation methodology employed. Several 
alignment programs and algorithms have been developed for 
this purpose [43-47 ]. 


4. For post-acquisition normalization, the MS - signal 
corresponding to each monitored metabolite or lipid, whether 
obtained in untargeted or targeted approaches, is normalized 
against an internal standard, critically of the same class as the 
analyte and either expressed as pmol equivalents of this stan- 
dard or placed back onto standard curve of a known, normal- 
ized standard. Following this quantification from sample 
extract, the normalized MS signals need to be expressed 
according to the amount of starting biological material (e.g., 
liquid volume, cell number, tissue wet weight, etc.). 


NMR can be used for nondestructive, continual, or in vivo mea- 
surements in biofluids, tissues, and intact tissues and in solid, 
semisolids, and gas phases, with variety of different experiments 
and instrument profiles and measurements of multiple different 
nuclei (e.g., 'H, TON, SC, *P), separately or simultaneously. In 
terms of metabolism modeling, NMR can provide longitudinal 
measurement for a system by either continual sampling or in vivo 
NMR measurement. Sample acquisition is limited with NMR 
experiments monitoring between 50 and 200 metabolites of high 
abundance (with concentrations greater than 1 pM). Briefly, steps 
in data derivation using NMR are as follows: 


1. It is essential to select the appropriate experiment for the 
system of interest — for fast, high-throughput, or continual 
sample monitoring and quantification, preferred are 1D experi- 
ments with water suppression (e.g., 1D NOESY or 1D CPMG) 
that require minimal sample preprocessing (in the basic case 
only involving addition of NMR reference material and pH 
buffer), while 2D NMR provides possibility for analysis of 
complex systems with unknown metabolites. Sample prepara- 
tion for different applications is reviewed in great detail 
elsewhere [48 ]. 
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2. Data processing from any NMR experiment involves signal 
processing (apodization, Fourier transform, phasing) and nor- 
malization (relative to NMR reference). Resulting spectrum 
provides both peak positions (in ppm) that can be used for 
assignment and peak intensities that are directly related to the 
analytes’ concentrations. With addition of internal reference, 
NMR can be used for absolute quantification of metabolites in 
the sample and comparison between different samples or time 
points. 


3. Metabolite assignment is performed in reference standards as 
described in [49-51] with a number of methods available for 
different sample types. Important considerations are that peak 
position shifts due to sample properties (i-e., pH, osmolality) 
and that line widths change with change in the magnetic field 
strength, sample viscosity, and composition possibly leading to 
changes in peak overlaps that can lead to errors in assignments. 
Thus, assignment and quantification should be done using 
information for comparable systems with specific assignment 
and quantification methods available, for example, for human 
blood or cerebrospinal fluid [52]. Several general methods are 
available, but prior to their utilization, the user should adjust 
parameters for specific sample set (reviewed recently in [53]). 


3.2 Prepare Omics A number of preprocessing steps are universally required for the 
Data for Further Model development of mechanistic models regardless of the modeling 
Development approach and omics data collected. These include: 


1. Data assignment and quantification. 


2. Using either novel data or information available in published 
databases, high quality, and relevant longitudinal data is 
required to build the model and optimize parameters. For 
metabolism modeling, it is essential to have assigned and quan- 
tified features measured for the specific biological system under 
conditions of interest. Genomics, transcriptomics, and/or pro- 
teomics should be used for contextualization of genome-scale 
models, and metabolomics/lipidomics or flux data are used for 
parameter determination in kinetic models or network optimi- 
zation in GENs. Kinetic parameters are available for many 
enzymatic reactions from ex vivo measurements (Table 1). 


3. Missing data imputation: Due to biological or technical rea- 
sons, some features will remain unidentified or unquantified. 
Depending on the cause for missing data, analysts should fol- 
low different strategies. Features with a large number of miss- 
ing values across conditions (of the order of 20-30% missing 
values) should be excluded from further analysis. Features with 
low abundance or undetected in specific samples where values 
fall below levels of detection can be imputed with a value that is 
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3.3 Develop a 
Mechanistic Model of 
Metabolic Processes 
of Interest 


a ratio of the lowest measurable value for the species (using 14 
or 1% of the lowest measured value for that feature) or set to 
0. Values missing due to experimental or technical errors can be 
imputed using computational methods, calculating missing 
values based on comparison with measured values in other 
samples determined to be similar. Extensive benchmarking of 
imputation methods has been presented recently [54] showing 
that in the majority of tests, random forest-based imputation 
provides an excellent approach for missing data estimates. 


4. Data scaling from different experimental platforms. As a variety 
of data sources can be used in the development of a metabolic 
model, it is crucial to perform appropriate normalization for 
each data type using either standard or internal references or 
relative feature levels before combining data for model build- 
ing. The analyst must also decide if low and high abundance 
analytes are placed on the same scale to ensure equal represen- 
tation. Methods have been discussed in great details previously 


[55, 56]. 


For the network of interest, first develop a set of ODEs or PDEs 
describing all reactions of interest in the model with appropriate 
dependencies and sink points in the format of Eq. 1. For large 
systems, an exact solution is not possible, and generally two 
approaches are applied. (1) Generate a quasi-steady-state assump- 
tion and resolve to the genome-scale model (2.b), or (2) use math- 
ematical functions to describe V(E, c, k) function applying available, 
measured, or estimated values for parameters (2.c): 


1. For genome-scale model development, omics data provided for 
the system of interest (e.g., genomics, transcriptomics, proteo- 
mics, metabolomics, lipidomics) are used for the development 
of the personalized genome-scale FBA model. In particular, 
gene transcription and gene mutation information are 
integrated to develop contextualized genome-scale models 
where information about lack of function (through either 
mutation or gene knockdown) can be used directly to delete 
unrelated reactions. Methods for optimization of models are 
available in COBRA [57] or RAVEN [58]. Both tools operate 
in MATLAB or Python and provide a variety of different opti- 
mization routines for the development of contextualized mod- 
els and optimization of metabolic flux. Recon3D provides a 
complete known metabolic network [25, 57]. The COBRA 
platform allows for the addition of new reactions and features. 


2. For dynamic network reactions, thermodynamic information 
can be obtained from existing databases (‘Table 1) ensuring that 
the kinetic information is curated and is up-to-date and for the 
appropriate species under investigation. The functional form of 
VE, c, k) can be approximated using Michaelis-Menten 


3.4 Integrate 
Mechanistic Model of 
Metabolic Processes 
with ML 


3.5 Examples of 
Methods 
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equation or other, more detailed formalisms and can possibly 
include inhibition and activation interactions. It is critical to 
ensure that used kinetic constants match the model type and 
units of metabolomic data. 


. FBA must be optimized for desired properties. This can be 


achieved by maximizing, for example, biomass production or 
cell growth using COBRA [57]. For dynamic models, kinetic 
parameters can be optimized from available data for the system. 
Optimization can be done using numerical methods or ML 
methods (e.g., LSMA; see Method B). 


. Experimentally validate FBA model by comparing predicted 


individual metabolite levels with matched pairs of metabolites 
measured in the metabolomic screen. 


. If stochastic aspects are significant for simulation, include ran- 


domness, for example, by using chemical Langevin formulation 
or Poisson mixture model (PMM) as recently presented [59 ]. 


. Integrate in silico fluxomic and other omics data: Data integra- 


tion can be performed in three ways — (a) early integration, 
concatenation of data into a unique dataset, (b) intermediate 
integration wherein the ML model is built using a combined 
transformation of the separate input sets, and (c) late integra- 
tion, where a separate model is built for each dataset and 
models are fused. Following integration, all data should be 
scaled, for example, by z-score scaling (see 2c). In the cross- 
validation process, training data should be normalized, and the 
same normalization parameters should be used for the test set. 
In the case of z-score normalization, the training set is normal- 
ized, and the mean and standard deviation values of the train- 
ing set are used to normalize the test set in order to prevent 
information leakage. 


. Develop ML architecture that allows analysis of integrated data: 


A variety of methods are available and can be explored with 
method proposed below resulting from [60]. Approaches for 
fusing experimental results with knowledge-based in silico 
models through interpretable ML are reviewed here [33]. 


. Combination of data: Data-independent ensemble ML can be 


used to combine all data (using the late integration approach; 
see above) including omics as well as the predicted metabolic 
data run by individual base learners. Subsequently, prediction 
and probabilities of prediction are combined for each base 
learner under meta-learner output with weights for each pre- 
dictor. The final probability of result is p = }°p;w; where 2 is 
base learner with probability of prediction p; and weight 1;. 
Alternatively, fluxomic data can be combined with other omics 
data and analyzed together using ML (with early or 
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intermediate data integration). Multimodal artificial neural 
network (MANN) method has shown the best performance 
for combined analysis of fluxomic and transcriptomic data [33 ]; 
however, different combinations and sizes of data require opti- 
mization of ML methods for any given application. 


. Optimization of hyperparameters for the model: Gradient boost- 


ing machine (GBM) algorithm can be used with Bayesian 
optimization for determining optimal hyperparameter values. 
Bayesian optimization is run in multiple iterations with fivefold 
cross validation used to determine the performance of selected 
hyperparameters. The weighted log loss must be calculated to 
determine performance metric for GBM and also to determine 
model performance on validation sets. The formula for 
weighted log loss is: 


Ns y [—(way;log (2;) + (1—»,) log (1—p,))] 2) 


with y; the true class label of sample 2, p; the predicted probabil- 
ity of sample z having predicted label, wg the weight for given 
label, and N, the total number of samples. Overfitting can be 
prevented by early stopping of the optimization process. Mean- 
weighted log loss with one standard error over all five folds of 
cross validation is used to determine the best hyperparameter 
set performance. 


. Test quality of ML model using cross validation: Data are split 


into training and testing and validation datasets. The training 
set, usually randomly selected 80% of the complete dataset, is 
used for training the model with a user-defined set of hyper- 
parameters. The validation part of the data (usually the remain- 
ing 20%) is used to assess model performance according to the 
set of hyperparameters optimized using the training set. 


. Test classifier performance for multiple iterations of randomized 


training/validation and testing data split: Preferred perfor- 
mance metrices are weighted log loss (Eq. 2), area under the 
receiver operator curve (AUROC), as well as measures com- 
paring true positive (TP), false positive (FP), true negative 
(TN), and false negative (FN) including: 


Sensitivity = 5 py (3) 
Specificity = —e (4) 
1 Peg EN 
Balanced Accuracy = 5 (op TEN t+ TN rs) (5) 


3.6 Determination of 
Parameters for 
Mechanistic Models 
from Data- or Theory- 
Driven ML 
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. Determine the importance of features in predictive models or 


classification: Feature selection for both individual groups of 
samples and across combined samples can be done by calculat- 
ing SHapley Additive exPlanations (SHAP) values for each 
classifier [61 ]. 


. Develop kinetic or constrained metabolic model as listed in 3.3. 


. Generate combinations of input parameters randomly, if infor- 


mation is available, constrain parameter values within allowed 
range. Parameters can include, for example, kinetic constants, 
cell growth rate, cell motility, and media metabolite concentra- 
tions. Model output values can include metabolite concentra- 
tion change over time, biomass information, and cell density as 
calculated by metabolic model. 


. Develop LSTM architecture with input layer, a fully connected 


layer, LSTM arrays, and two output layers, one for predicting 
peak values of distributions and one for predicting the normal- 
ized distributions. Vanilla LST'M is available in MATLAB and 
Python (TensorFlow or PyTorch). In the application of GPR, 
with prior information, architecture development requires 
selection or generation of appropriate kernel functions with 
possibility for additive kernel functions. 


. Perform input and output data preprocessing, including data 


scaling with, for example, min-max scaling to get all data to the 
0-1 range or z-score normalization. 


. Use the calculated molecular value distribution obtained in 


3.6.2. with a random combination of parameters to train ML 
models. 


. In the application of LSTM, parameters are used as input and 


molecular values as output of the neural network model. Ran- 
domly divide the data into training and test sets for cross- 
validation assessment of model accuracy, or use leave-one-out 
cross validation. 


. In LSTM, model input parameters are connected first to all 


neurons in the fully connected layer. Select the activation func- 
tion (e.g., exponential linear unit), and initialize connection 
weights randomly. 


. Optimize the network using, for example, cross entropy, and 


calculate the cost function of the neural network using mean 
squared error. 


. Evaluate the model using the test set with, for example, calcu- 


lation of root mean square error (RMSE) to determine the 
difference between LSTM and mechanistic model results. 
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10. For prediction of new values, use developed LSTM with new 
parameter inputs, and for enhanced accuracy, use the ensemble 
approach, for example, with Wisdom of the Crowd analysis. In 
this approach, calculations are rerun with the same input, and 
similarity scores are calculated between different predictions 
using RMSE, R2, or some other similarity assessment function. 
Each prediction is evaluated with an assessment score relative 
to the average prediction and the result with the minimal score, 
i.e., minimal deviation from the average score is selected as the 


final prediction result. 
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A Machine Learning-Based Approach Using Multi-omics 
Data to Predict Metabolic Pathways 


Vidya Niranjan, Akshay Uttarkar, Aakaanksha Kaul, 
and Maryanne Varghese 


Abstract 


The integrative method approaches are continuously evolving to provide accurate insights from the data 
that is received through experimentation on various biological systems. Multi-omics data can be integrated 
with predictive machine learning algorithms in order to provide results with high accuracy. This protocol 
chapter defines the steps required for the ML-multi-omics integration methods that are applied on 
biological datasets for its analysis and the visual interpretation of the results thus obtained. 


Key words Multi-omics, Machine learning, Integration, Algorithms, Unsupervised learning, 
Supervised learning 


1 Introduction 


In response to the vast amounts of omics data generated from high- 
throughput technologies, many integrated approaches have been 
sought out to aid in their analyses and visualization. Despite the fact 
that omics studies like metabolomics, lipidomics, and glycomics are 
not included in the core dogma analysis [1], they nonetheless 
provide a wealth of information about metabolites, lipids, and 
glycans (synthesized by the proteome via biosynthetic pathways) 
[2]. For example, because of its high sensitivity, high throughput, 
and unbiasedness, nontargeted metabolomics has attracted wide- 
spread attention as a method of profiling endogenous metabolites. 
As a result, metabolomics approaches have increasingly been 
applied to a variety of areas, including medication evaluation and 
monitoring [3]. By utilizing cutting-edge analytical technologies, 
metabolomics techniques are utilized to thoroughly analyze the 
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2 Methods 


metabolite composition of biological materials. In recent years, the 
liquid chromatography-mass spectrometry technological platform 
has become the most widely utilized analytical tool for metabolo- 
mics research [4]. Using tools like MetScape [5] and Mummichog 
[6], mapping of metabolites to pathways can be achieved. But the 
larger question that is required to be focused on is whether other 
“omics” data, for example, a proteomics data, can be integrated 
with metabolomics to achieve better accuracy in mapping results. 
For example, it can be seen that the integration of the RNA-Seq 
and the ChIP-Seq analyses on the data obtained from cell lines of 
head and neck squamous cell carcinoma (HNSCC) recognized the 
association between cancer-specific histone marks - H3K4me3 and 
H3K27ac — and transcriptional changes that are observed in the 
driver genes of HNSCC, epidermal growth factor receptor 
(EGFR), FGFRI, and FOXA1 [7]. Hence multi-omics has been 
proven significant over results obtained from single-omics data. 
The use of a multi-omics approach has led to the creation of a 
variety of tools, methods, and platforms for multi-omics data 
analysis. 

Machine learning (ML) methods in high-throughput multi- 
omics analyses have been gaining popularity in the recent decade 
due to the increased accessibility of high computing power. It can 
be used to interpret and visualize the data that is obtained in the 
present and use it to create an algorithm that can predict the results 
of datasets that may be studied in the future. This is used where 
there are challenges in designing definitive algorithms in a particu- 
lar problem set. The use of deep learning, a classification of machine 
learning, has gained momentum in cases of complex operations 
that are required to read the dataset and formulate predictions 
(see Notes 1 and 2). 

The superiority of integration approaches in grasping the com- 
plexity of diverse diseases and identifying the underlying anomalies 
from substantially generated multi-omics data, which is not always 
achievable with individual omics analysis, is highlighted by a num- 
ber of recent multi-omics research (see Note 3). 


The tools to be chosen for integration of a given multi-omics 
dataset with machine learning depend on the type of data, its 
quantum, integration method, and expected outcome. The data 
can be concatenated at an early stage, can be integrated at a late 
stage, or can be integrated as a transformation. With each of the 
previous options, supervised or unsupervised methods of machine 
learning can be used accordingly. An assemblage for the type of 
datasets which can be used (Fig. la) has been provided which 
includes all the possible types of biological datasets in any possible 
combination (Table 1). 
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Fig. 1 A schematic representation of recommended algorithms for multi-omics data integration and analysis. 
(a) Source of types of datasets which can be used for integration and analysis and in any combination. (b) The 
scheme for unsupervised ML method wherein based on the need to early or late data concatenation or 
transformation model the mentioned algorithms can be used. (c) For datasets with supervised learning, the 
scheme provides list of recommended algorithms which can be used for data integration, filtering, and 
clustering 


Table 1 
The list of tools recommended to be used for predicting metabolic pathways from multi-omics data 


SI 
no. Genomics Transcriptomics Proteomics Metabolomics Tools to be used 


1 Present Present Present Autoencoders, elastic net, SVM, and 
consensus clustering 


2) Present Present RE 
PLS-DA 
Extra trees 

3 Present Present Graphical RF 


4 Present Present SVM 
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2.1 Early 
Concatenation of Data 


Supervised learning is a machine learning approach that uses 
labeled data to train models [8]. It is used to address problems 
involving regression and classification. Unsupervised learning is 
another machine learning approach that uses unlabeled input data 
to discover patterns. It is used to tackle problems involving associa- 
tion and clustering. A schematic representation of the recom- 
mended algorithms to be used for integration and data analysis 
for supervised learning is shown in Fig. 1b. Similarly, if the require- 
ment is to perform an unsupervised learning based on data avail- 
ability, a schematic representation (Fig. lc) recommends a list of 
algorithms which can be used along with data filtering and 
clustering. 


If the data is such that it can be concatenated at an early stage, the 
following directions and tools can be followed in terms of unsuper- 
vised and supervised methods (see Notes 4 and 5). 


Unsupervised: 


1. Check if the multi-omics dataset is overlapping. If there is a 
partial overlap, MOFA (multi-omics factor analysis) is used [9 ]. 


2. If the overlap is complete, check if there is a large dataset after 
integration. If yes, tools like moCluster [10], BN (Bayesian 
network) [11], LRAcluster [12], iClusterBayes [13], and 
even MOFA can be used. 


3. If the dataset doesn’t have a large dataset post-integration, 
check if it has negative values. If yes, use iCluster [14]. 


4. If all values are positive, then check if the dataset has different 
distributions. If it does, then use iCluster+ [15], JIVE, Joint 
and Individual Variation Explained [16], JBF, joint Bayes 
factor [17], or even tools from points 2 and 1. 


5. If the dataset has similar distributions, use joint NMF (non- 
negative matrix factorization) random forest [18]. 


Supervised: 


1. Check if a large dataset is produced after integration. If yes, 
either ensemble methods like LASSO (least absolute shrinkage 
and selection operator) [19] can be employed or filter (like 
information gain) or wrapper methods (like RFE) can be 
used. When using the latter, a reduced dataset is obtained 
which can be further dealt with using tools like DT (decision 
tree) [20], NB (naive Bayes) [21], SVM (support vector 
machine) [22], KNN (k-nearest machine) [23], K-Star [24], 
BT [25], SVR (support vector regression) [26], ANN (artificial 
neural network) [27], and DNN [28] (see Notes 6 and 7). 


2. Ifa smaller dataset is produced, the same tools that are used on 
the reduced dataset in the previous point can be used 


2.2 Concatenation of 
Data at Later a Stage 


2.3 Integration of 
Data as 
Transformation 
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If the data can be integrated at a later stage: 
Unsupervised: 


Check how many omics datasets there are to integrate. If the 
number is greater than 2, tools like FCA consensus clustering [29], 
MDI (multiple dataset integration) [30], BCC (Bayesian consensus 
clustering) [31], SNF (similarity network fusion) [32], PINS (per- 
turbation clustering for data integration and disease subtyping), 
and PINS+ are used (see Note 8): 


1. If the number is 2 itself, then check if one of the omics is 
gene expression. If not, use tools mentioned in point 1. 


2. If yes, check if the other omics is copy number variation 
(CNV) data [33]. If yes, tools like PSDF (patient-specific 
data fusion) [34], LemonTree [35], and CONEXIC [36] 
are used. If not, just LemonTree or the tools mentioned in 
point 1 can be used. 


Supervised: 


Tools like majority-based voting [37], hierarchical classifiers 
[38], ensemble-based (XGBoost [39] and KNN [40]), MOLI 
(multi-omics late integration) [41], Hi-DFN Forest [42], 
ATHENA (Analysis Tool for Heritable and Environmental Net- 
work Associations) [43], and autoencoder-based classifiers can 
be used. 


If the dataset can be integrated as a transformation: 
Unsupervised: 


1. Check if the multi-omics datasets are overlapping. If the over- 
lap is partial, NEMO (neighborhood-based multi-omics clus- 
tering) [44] can be used. 


2. If overlap is complete, tools like rMKL-LPP, regularized mul- 
tiple kernel learning for locality preserving projections [45], 
PAMOGK (pathway graph kernel-based multi-omics cluster- 
ing approach) [46], Meta-SVM [47], and NEMO are used. 


Supervised: 


Check if it is a kernel- or graph-based transformation. If it is a 
kernel-based transformation, tools like SDP-SVM [48], FSMKL, 
multiple kernel learning for feature selection [49 ], RVM (relevance 
vector machine) [50, 51], AdaBoost RVM [52], and fMKL-DR 
[53] are used. 


If it is a graph-based transformation, tools like SSL (semi- 
supervised learning) [54-58 ], graph sharpening [59, 60], compos- 
ite network [61], Bayesian network [62], and MORONET (Multi- 
Omics gRaph cOnvolutional NETworks) [63] are used. 
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3 Application 


3.1 Supervised 
Learning 


3.1.1 Test Case: Deep 
Learning-Based Multi- 
omics Integration [64] 


Tools to be used for analysis via integration of metabolomics 
data/datasets with other types of commonly available or gener- 
ated data is consolidated in Table 1 (see Notes 9-13). 


Hepatocellular carcinoma (HCC) is the most common kind of liver 
cancer (70-90%), and establishing robust survival subgroups will 
improve patient care dramatically. There is a paucity of research that 
takes into account the high level of heterogeneity and integrates 
multi-omics data to explicitly predict HCC survival from various 
patient cohorts. To close this gap, the paper used 15,629 genes 
from RNA-Seq, 365 miRNAs from miRNA-Seq, and 19,883 genes 
from DNA methylation data from The Cancer Genome Atlas 
(TCGA) as input characteristics for 360 samples to build the 
DL-based, survival-sensitive model, which predicts prognosis as 
well as an alternative model that considers both genomics and 
clinical data. The study employs a total of six cohorts that reliably 
distinguishes patient survival subpopulations in the investigation, 
and their descriptions are listed below: 


e The TCGA data was used in two steps: The first step is to use 
the entire TCGA dataset to obtain the labels of survival risk 
classes; the second is to train a support vector machine 
(SVM) model by splitting the samples 60/40 between train- 
ing and held-out testing data. 


e To assess the DL-based prognosis model’s prediction accu- 
racy, the study used five additional confirmation datasets. 


e The TCGA portal provided the multi-omics HCC data. 


¢ Ifthe biological traits had no value in more than 20% of the 
patients, it was deleted. 


e If more than 20% of the characteristics were missing, the 
samples were deleted. 


e To fill in the missing values, the study utilized the impute 
function from the R impute package [65 ]. 


e Across all samples, input characteristics with zero values 
were deleted. 


Using an autoencoder, a DL framework [66], the three types of 
omics characteristics were stacked together. Each of the 100 fea- 
tures was subjected to univariate Cox PH regression, and 37 of 
them were found to be substantially linked with survival. K-means 
clustering was used to group these 37 traits, with cluster number K 
ranging from 2 to 6. For the ensuing supervised machine learning 
operations, it was established that K 142 was the ideal number of 
classes. 


3.2 Unsupervised 
Learning 


3.2.1 Test Case: 
Multidata Integration of 
Metabolomics and 
Transcriptomics to Reveal 
the Modulation Network of 
Cell Regulation [67] 


4 Notes 
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Cells employ multiple levels of regulation, including transcriptional 
and translational regulation, that drive core biological processes 
and enable cells to respond to genetic and environmental changes. 
Small-molecule metabolites are one category of critical cellular 
intermediates that can influence as well as be a target of cellular 
regulations. Because metabolites represent the direct output of 
protein-mediated cellular processes, endogenous metabolite con- 
centrations can closely reflect cellular physiological states, especially 
when integrated with other molecular-profiling data. In this partic- 
ular case study, a network reconstruction approach simultaneously 
integrates six different types of data, endogenous metabolite con- 
centration, RNA expression, DNA variation, DNA-protein bind- 
ing, protein-metabolite interaction, and _ protein-protein 
interaction data, to construct probabilistic causal networks that 
elucidate the complexity of cell regulation in a segregating yeast 
population. 

Two classes of data were employed to reconstruct probabilistic 
causal networks: (1) DNA variation, gene expression, and metabo- 
lite data measured in the BXR cross (referred to here as BXR data) 
and (2) protein-DNA binding, protein-protein interaction, and 
metabolite-protein interaction data available from public data 
sources and generated independently of the BXR cross (referred 
to here as non-BXR data). The BXR data are reflected as nodes in 
the network, where edges in the network reflect statistically inferred 
causal relationships among the expression and metabolite traits. 


1. The intricacy of multi-omics data processing necessitates col- 
laboration between the clinical and machine learning commu- 
nities, as well as the use of approaches from many fields. We 
found that some promising methods, such as matrix factoriza- 
tion, have not been widely used, whereas clustering and 
network-based approaches have been widely used, owing to 
their flexibility and ability to be integrated into comprehensive 
frameworks that include feature extraction and transformation 
to overcome the dimensionality curse [68]. 


2. Other types of noise and filtering methods’ impact on omics 
integration should be investigated in the future. Molecular 
pathways [69], biomarkers [70], and sample categorization 
[71] have all been discovered via multi-omics integration [15]. 


3. In order to reduce their complexities and heterogeneities and 
facilitate their subsequent integration and analysis, most inte- 
gration algorithms created in recent years prefer to initially 


448 


Vidya Niranjan et al. 


10. 


11. 


change and transform each dataset using multiple machine 
learning models, and this is known as mixed integration [8]. 


. Although early and intermediate integration strategies solve 


this problem by integrating all datasets ahead of time, the 
large matrix generated by early integration is hard for most 
ML models to exploit, and intermediate integration often 
depends on unsupervised matrix factorization, which has diffi- 
culty incorporating the substantial amount of preexisting 
biological knowledge [8]. 


. Ifthe model is not designed for a specific purpose or for specific 


multi-omics datasets, there are chances of it performing poorly 
[72, 73]. Massive matrices, outliers, highly correlated variables, 
noise, and other difficulties are worsened in multi-omics inves- 
tigations, and some models can’t manage them [74]. It may 
also be a possibility that the omics are not integrated 


properly [9]. 


. As some omics will contain less or no useful information, the 


complementarity of datasets and their relative pertinence 
should be taken into account [10]. 


. It is still challenging to translate DL model variable weights 


into a context that domain specialists can understand [75 ]. Net- 
work mapping [11], in which statistical, functional, and 
ML-based outputs are transferred onto network manifolds 
(similarity, biochemical, and empirical), is required to be 
adapted for the layered DL feature space. 


. Identifying causation in complex phenotypes currently makes 


specialized analysis and domain expert interpretation necessary; 
however, in the future, medical data accessibility, quality, and 
scale may enable near-automated DL-based detection of many 
clinically relevant events [12]. 


Current ERM-based machine learning methods have some 
limitations, such as identifying the causal relationships between 
variables [76]. The learning algorithm seeks to absorb all of the 
association links (e.g., correlation) identified in the data when 
minimizing empirical error [13]. To solve the association- 
versus-cause conundrum, the invariant risk minimization 
(IRM) theoretical framework for learning causations by infer- 
ring invariances across conditions (e.g., different omics in 
biological context) was presented [14]. 


For machine learning applications, biological interpretability 
remains a hurdle [77]. Previous work like, for example, inter- 
pretable deep neural network modeling [16], incorporated 
biological knowledge into the machine learning model for 
underlying mechanisms to address this [78 ]. 


When working on rare events, such as an uncommon attribute 
in a population, class imbalance develops when the distribution 


5 Summary 
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of classes in the learning data is skewed, which can be a serious 
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